home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-04-11 | 141.2 KB | 3,188 lines |
- Archive-name: comp-speech-faq/part1
- Last-modified: 1995/01/11
-
-
- COMP.SPEECH FAQ POSTING - PART 1/3
-
-
- [Note: this document has been automatically extracted from a WWW site:
- http://www.speech.su.oz.au/comp.speech
- This may introduce some formatting errors.]
-
-
- Comp.Speech Frequently Asked Questions
-
- The Frequently Asked Questions (FAQ) is a regular posting to
- comp.speech which attempts to answer some of the regular questions in
- the comp.speech newsgroup.
-
- The FAQ is not meant to discuss any topic exhaustively. It will
- hopefully provide readers with pointers on where to find useful
- information, especially material available on the Internet.
-
- If you have not already read the Usenet introductory material posted
- to "news.announce.newusers", please do. For help with FTP (file
- transfer protocol) look for a regular posting of "Anonymous FTP List -
- FAQ" in comp.misc, comp.archives.admin or news.answers.
-
- This FAQ is posted every 4 weeks to comp.speech, comp.answers &
- news.answers.
-
- It is also available for anonymous ftp from the comp.speech archive
- site :
- * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/FAQ-complete
-
- Or from the news.answers ftp site (and its mirrors)
- * ftp://rtfm.mit.edu/pub/usenet/news.answers/comp-speech-faq/*
-
- Or on the World Wide Web
- * http://www.speech.su.oz.au/comp.speech
-
- Or by sending email to mail-server@rtfm.mit.edu with the following
- line in the body of the message:
- * send usenet/news.answers/comp-speech-faq/*
-
- Admin
-
- Not much to report this month. Hopefully, February should see some
- major catch-up work.
-
- FAQ Sections
-
- The FAQ is divided into the following sections:
- * FAQ Contents
-
- * List of Speech Technology Products and Software
-
- * FAQ Section 1: General Information on Speech Technology
- * FAQ Section 2: Signal Processing
- * FAQ Section 3: Speech Coding and Compression
- * FAQ Section 4: Natural Language Processing
- * FAQ Section 5: Speech Synthesis
- * FAQ Section 6: Speech Recognition
-
- Comp.Speech FTP Site
-
- The comp.speech ftp site (which is described in Q1.2) contains the
- following:
- * Newsgroup Archives
- * Data Resources
- * General Information
- * Software
-
- Acknowledgements
-
- Hundreds of people have made contributions to the comp.speech FAQ over
- the last two years; there are too many to name individually. Special
- thanks go to Tony Robinson and Joe Campbell who have been particularly
- helpful.
-
- Maintainence
-
- The FAQ posting and the Comp.Speech WWW Site are maintained by
-
- Andrew Hunt
- ---
- Speech Technology Research Group
- Dept. of Electrical Engineering
- University of Sydney, NSW, 2006, Australia
- Ph: 61-2-351 4509
- Fax: 61-2-351 3847
- email: andrewh@speech.su.oz.au
-
-
- ===========================================================================
-
-
- COMP.SPEECH FAQ CONTENTS
-
- Introduction
-
- * Overview
- * List of Packages
-
- Section 1 : General Information on Speech Technology
-
- * Q1.1 What is comp.speech?
- * Q1.2 Where are the comp.speech archives?
- * Q1.3 Common abbreviations and jargon.
- * Q1.4 What are related newsgroups and mailing lists?
- * Q1.5 What are related journals and conferences?
- * Q1.6 What resources are available as handicap aids?
- * Q1.7 What speech data is available?
- * Q1.8 Speech File Formats, Conversion and Playing.
- * Q1.9 What "Speech Laboratory Environments" are available?
- * Q1.10 Miscelaneous Software and Other Resources.
-
- Section 2 : Signal Processing for Speech
-
- * Q2.1 What sampling do I need for speech?
- * Q2.2 How do I find the pitch of a speech signal?
- * Q2.3 How do I find the start and end points of a speech signal?
- * Q2.4 Where can I find FFT software?
- * Q2.5 What signal processing techniques are used in speech
- technology?
- * Q2.6 What speech sampling and signal processing hardware can I
- use?
- * Q2.7 How do I convert to/from mu-law format?
-
- Section 3 : Speech Coding and Compression
-
- * Q3.1 Speech compression techniques.
- * Q3.2 What are some good references/books on coding/compression?
- * Q3.3 What software is available? (Includes CELP & G.7xx)
-
- Section 4 : Natural Language Processing
-
- * Q4.1 What are some good references/books on NLP?
- * Q4.2 What NLP software is available?
-
- Section 5 : Speech Synthesis
-
- * Q5.1 What is speech synthesis?
- * Q5.2 How can speech synthesis be performed?
- * Q5.3 What are some good references/books on synthesis?
- * Q5.4 What software/hardware is available?
-
- Section 6 : Speech Recognition
-
- * Q6.1 What is speech recognition?
- * Q6.2 How can I build a very simple speech recogniser?
- * Q6.3 What does speaker dependent/adaptive/independent mean?
- * Q6.4 What does small/medium/large/very-large vocabulary mean?
- * Q6.5 What does continuous speech or isolated-word mean?
- * Q6.6 How is speech recognition done?
- * Q6.7 What are some good references/books on recognition?
- * Q6.8 What speech recognition packages are available?
-
-
- ===========================================================================
-
-
- FAQ: List of Packages
-
- The comp.speech FAQ provides information on a range of software,
- hardware and resources.
-
- Speech Data
-
- * Phonemic Samples
- * Linguistic Data Consortium (LDC)
- * Center for Spoken Language Understanding (CSLU)
- * PhonDat - A Large Database of Spoken German
- * Oxford Acoustic Phonetic Database
-
- Speech Processing Environments
-
- * Entropic Signal Processing System (ESPS) and Waves
- * CSRE: Canadian Speech Research Environment
- * OGI Speech Tools
- * Matlab plus Signal Processing Toolbox
- * Signalyze 3.0 from InfoSignal
- * Kay Elemetrics CSL (Computer Speech Lab) 4300
- * MacSpeech Lab II (MSL II)
- * N!Power
- * Ptolemy
- * Khoros
- * SpeechViewer II
-
- Other Resources
-
- * CMU Dictionary
- * Another Dictionary
- * BEEP dictionary
- * CUVOLAD dictionary
- * MRC database
- * Network Audio System
- * NEVOT (1.4v) from AT&T; BL
- * Human Audio Perception Document
- * Homophone List
- * Auditory Toolbox for Matlab
- * Auditory Modeller 1
- * Auditory Modeller 2
-
- Audio I/O Hardware
-
- * Sun standard audio port (SPARC I & II)
- * Sun standard audio port (SPARC 10 & 20)
- * Ariel Signal Processors
- * IBM RS/6000 ACPA (Audio Capture and Playback Adapter)
- * Sound Galaxy NX , Aztech Systems
- * Sound Galaxy NX PRO, Aztech Systems
- * ATI Stereo F/X Sound Board
- * Various PC Sound Cards
-
- Compression Software and Hardware
-
- * File format conversion
- * shorten - a lossless compressor for speech signals
- * 32 kbps ADPCM
- * GSM 06.10 Compression
- * G.721/722/723 Compression
- * G.728 Compression
- * G.728 LD-CELP vocoder
- * U.S.F.S. 1016 CELP vocoder for DSP56001
- * 8 Kbit/s CELP on the TMS320C5x family of DSP chips
- * CELP 3.2a & LPC
-
- Natural Language Processing
-
- * Natural Language Software Registry (NLSR) - NLP Tools
- * Part of Speech Tagger
-
- Speech Synthesis
-
- * Orator Text-to-Speech Synthesizer
- * Text to phoneme program (1)
- * Text to phoneme program (2)
- * Text to phoneme program (3)
- * Text to speech program
- * "Speak" - a Text to Speech Program
- * TheBigMouth - a Text to Speech Program
- * TextToSpeech Kit
- * SGI Developers Toolbox Synthesiser
- * rsynth
- * SENSYN speech synthesizer
- * spchsyn.exe
- * CSRE: Canadian Speech Research Environment
- * Eloquence (currently an alpha release)
- * JSRU
- * Klatt-style synthesiser
- * DECTalk
- * Speech Manager and PlainTalk
- * Various Mac Speech Output Applications
- * MacinTalk
- * Monologue by Creative Labs
- * Lernout & Hauspie Text-To-Speech SDK
- * Tinytalk
- * Narrator - narrator.device
- * Infovox Product Range
- * SIMTEL-20
-
- Speech Recognition
-
- * HM2007 - Speech Recognition Chip
- * Voice Blaster Ver. 4.0
- * Votan
- * Entropic's HTK (HMM Toolkit)
- * DragonDictate version 3.0
- * DragonDictate for Windows
- * DragonVoiceTools
- * IBM Personal Dictation System
- * Osborne Personal Dictation System (in Australia)
- * VoiceServer for Windows
- * IN3 Voice Command for Windows
- * IN3 Voice Command
- * Phonetic Engine 400 (PE400) - Speech Systems, Inc.
- * SayIt
- * Kurzweil Voice for Windows 1.0
- * D6006 Voice Control Processor
- * Speech Commander - Listen for Windows
- * Voice-Trek 2.0
- * Visus SpeechKit
- * recnet
- * Lotec Speech Recognition Package
- * Myers' Hidden Markov Model software
- * Voice Command Line Interface
- * DATAVOX - French
- * PowerSecretary
- * ICSS system from IBM
- * Creative VoiceAssist
-
-
- ===========================================================================
-
-
- FAQ SECTION 1 - General
-
- Q1.1: WHAT IS COMP.SPEECH?
-
- Comp.speech is a newsgroup for discussion of speech technology and
- speech science. It covers a wide range of issues from application of
- speech technology, to research, to products and lots more. By nature
- speech technology is an inter-disciplinary field and the newsgroup
- reflects this. However, computer application is the basic theme of the
- group.
-
- The following is a list of topics but does not cover all matters
- related to the field (no order of importance is implied).
- * Speech Recognition - discussion of methodologies, training,
- techniques, results and applications. This should cover the
- application of techniques including HMMs, neural-nets and so on to
- the field.
-
- * Speech Synthesis - discussion concerning theoretical and
- practical issues associated with the design of speech synthesis
- systems.
-
- * Speech Coding and Compression - both research and application
- matters.
-
- * Phonetic/Linguistic Issues - coverage of linguistic and phonetic
- issues which are relevant to speech technology applications. Could
- cover parsing, natural language processing, phonology and prosodic
- work.
-
- * Speech System Design - issues relating to the application of
- speech technology to real-world problems. Includes the design of
- user interfaces, the building of real-time systems and so on.
-
- * Other matters - relevant conferences, jobs, books, software,
- hardware, and products.
-
- _________________________________________________________________
-
- Q1.2: WHERE ARE THE COMP.SPEECH ARCHIVES?
-
- comp.speech is being archived for anonymous ftp.
- * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/archive/
-
- comp.speech/archive contains the articles as they arrive. Batches of
- 100 articles are grouped into a shar file, along with an associated
- file of Subject lines.
-
- Other useful information is also available in comp.speech/info.
- _________________________________________________________________
-
- Q1.3: COMMON ABBREVIATIONS AND JARGON.
- * ANN - Artificial Neural Network.
- * ASR - Automatic Speech Recognition.
- * ASSP - Acoustics Speech and Signal Processing
- * AVIOS - American Voice I/O Society
- * CELP - Code-book Excited Linear Prediction.
- * COLING - Computational Linguistics
- * DTW - Dynamic Time Warping.
- * FAQ - Frequently Asked Questions.
- * HMM - Hidden Markov Model.
- * IEEE - Institute of Electrical and Electronics Engineers
- * JASA - Journal of the Acoustic Society of America
- * LPC - Linear Predictive Coding.
- * LVQ - Learned Vector Quantisation.
- * NLP - Natural Language Processing.
- * NN - Neural Network.
- * TI - Texas Instruments.
- * TIMIT - A large speech corpus from TI and MIT - see Q1.7
- * TTS - Text-To-Speech (i.e. synthesis).
- * VQ - Vector Quantisation.
-
- _________________________________________________________________
-
- Q1.4: WHAT ARE RELATED NEWSGROUPS AND MAILING LISTS?
-
- Newsgroups
-
- comp.ai - Artificial Intelligence newsgroup.
- Postings on general AI issues, language processing and AI
- techniques. Has a good FAQ including NLP, NN and other AI
- information.
-
- comp.ai.nat-lang - Natural Language Processing Group
- Postings regarding Natural Language Processing. Set up to cover
- a broard range of related issues and different viewpoints.
-
- comp.ai.nlang-know-rep - Natural Language Knowledge Representation
- Moderated group covering Natural Language.
-
- comp.ai.neural-nets - discussion of Neural Networks and related
- issues.
- There are often posting on speech related matters - phonetic
- recognition, connectionist grammars and so on.
-
- comp.compression - occasional articles on compression of speech.
- FAQ for comp.compression has some info on audio compression
- standards.
-
- comp.dcom.telecom - Telecommunications newsgroup.
- Has occasional articles on voice products.
-
- comp.dsp - discussion of signal processing - hardware and algorithms
- and more.
- Has a good FAQ posting. Has a regular posting of a
- comprehensive list of Audio File Formats.
-
- comp.multimedia - Multi-Media discussion group.
- Has occasional articles on voice I/O.
-
- sci.lang - Language.
- Discussion about phonetics, phonology, grammar, etymology and
- lots more.
-
- alt.sci.physics.acoustics
- Some discussion of speech production & perception.
-
- alt.binaries.sounds.misc - posting of various sound samples
-
- alt.binaries.sounds.d - discussion about sound samples, recording
- and playback.
-
- Mailing Lists
-
- ECTL - Electronic Communal Temporal Lobe
- Founder & Moderator: David Leip. Moderated mailing list for
- researchers with interests in computer speech interfaces. This
- list serves a broad community including persons from signal
- processing, AI, linguistics and human factors. To subscribe,
- send your name, institute, department, daytime phone and email
- address to:
-
- + ectl-request@snowhite.cis.uoguelph.ca
-
- The ECTL archive site is
-
- + ftp://snowhite.cis.uoguelph.ca/pub/ectl
-
- Prosody Mailing List
- Unmoderated mailing list for discussion of prosody. The aim is
- to facilitate the spread of information relating to the
- research of prosody by creating a network of researchers in the
- field. If you want to participate, send the following one-line
- message to
-
- + listserv@msu.edu
- + subscribe prosody Your Name
-
- foNETiks
- A moderated monthly newsletter distributed by e-mail. It
- carries job advertisements, notices of conferences, and other
- news of general interest to phoneticians, speech scientists and
- others The editors are Linda Shockey and Gerry Docherty. To
- subscribe send the following 1 line message to
-
- + mailbase@mailbase.ac.uk
- + join fonetiks your_first_name your_second_name
-
- Digital Mobile Radio
- Covers lots of areas include some speech topics including
- speech coding and speech compression. Mail Peter Decker
- dec@dfv.rwth-aachen.de to subscribe.
-
- _________________________________________________________________
-
- Q1.5: WHAT ARE RELATED JOURNALS AND CONFERENCES?
-
- Try the following commercially oriented magazine:
- * Voice News - monthly industry newsletter
- Stoneridge Technical Services
- PO Box 1891, Rockville, MD, 20850, USA
- Phone: (301) 424-0114
- * Voice Technology News
- * Voice Processing Magazine (1-800-854-3112)
- * Speech Technology (no longer published)
-
- Try the following technical journals (some contact addresses below):-
- * IEEE Transactions on Speech and Audio Processing (from Jan 93)
- * IEEE Signal Processing Magazine (from Jan 93)
- * IEEE Transactions on Acoustics, Speech, and Signal Processing
- (ASSP) (now obsolete)
- * Computational Linguistics (COLING)
- * Computer Speech and Language
- * Journal of the Acoustical Society of America (JASA)
- * AVIOS Journal
- * ASR News
-
- Try the following conferences:-
- * ICASSP Intl. Conference on Acoustics Speech and Signal Processing
- (IEEE)
- * ICSLP Intl. Conference on Spoken Language Processing
- * EUROSPEECH European Conference on Speech Communication and
- Technology
- * AVIOS American Voice I/O Society Conference
- * SST Australian Speech Science and Technology Conference
-
- Here are a few contact addresses:-
-
- Publications:
- IEEE Transactions on Speech and Audio Processing (from Jan 93)
- IEEE Transactions on Acoustics, Speech, and Signal Processing
- (ASSP) - now obsolete.
-
- Organization:
- Institute of Electrical and Electronics Engineers (IEEE)
-
- Contact:
- IEEE Service Center
- 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855, USA
- Phone: 1-800-678-IEEE or (201)981-0060
-
- Publications:
- Computer Speech and Language
-
- Contact:
- Academic Press, Ltd.
- 24-28 Oval Rd, London NW1, England
-
- Price:
- $136 (Institutions), $58 (Individuals)
-
- Publications:
- Association for Computational Linguistics
-
- Organization:
- Association for Computational Linguistics
- MIT Press Journals
- 55 Hayward St, Cambridge, MA 02142, USA
- Phone: (617)253-2889
-
- _________________________________________________________________
-
- Q1.6: WHAT RESOURCES ARE AVAILABLE AS HANDICAP AIDS?
-
- Can anyone provide information on speech technology aids for the deaf,
- blind, speech impaired, physically impaired and other groups who may
- benefit from speech technology?
-
- SpeechViewer II
- * Platform: IBM Machines from Mod 25 on.
- * Description: SpeechViewer II is a speech therapy tool. It
- provided graphical feedback of various speech features so that
- speech impaired individuals can improve their speech. It works
- with an audio bandwidth of 7.3 Khz and thus allows the therapist
- to work with sustained vowels and fricatives. A wide range of
- graphics are used to provide adequate variability to hold client
- interest. An extensive set of statistics are gathered which allows
- a therapist to do research or keep therapy records. The speech
- therapy modules are:
- + Awareness - Sound, Loudness, Pitch, Voicing Onset, Voicing
- + Skill Building - Pitch, Voicing, Phonology
- + Patterning - Pitch & Loudness - Waveform & Spectrogram,
- Spectra
- + Clinical Management - Profiles, Models, Client Data
- * Hardware: Requires an IBM M-ACPA (Multimedia-Audio Capture
- Playback Adapter). It has a TI TMS320C25 DSP chip. The input
- sampling rate is 44.1 Khz stereo, 88.2 Khz mono. This is a 16 bit
- card. It has the following jacks: mic in, stereo line in, stereo
- line out, speaker out. Note: This card is being replaced by Mwave
- technology. For more info on Mwave contact Texas Instruments.
- * Price:
- + The software is $2130 list, $1491 educational, part number
- 92F2066.
- + The M-ACPA is $370 list, $222 educational, part number
- 92F3378.
- + The MicroChannel adapter part number is 92F3379 (same price).
- * Contact: The Psychological Corporation (TPC) [IBM Authorized
- Remarketer]
- Phone: 1-800-228-0752 or contact IBM on 1-800-426-4832.
-
- _________________________________________________________________
-
- Q1.7: WHAT SPEECH DATA IS AVAILABLE?
-
- A wide range of speech databases have been collected. These databases
- are primarily for the development of speech synthesis/recognition and
- for linguistic research.
-
- Some databases are free but most appear to be available for a small
- cost. The databases normally require lots of storage space - do not
- expect to be able to ftp all the data you want.
-
- Phonemic Samples
- * First, some basic data. The following ftp sites have samples of
- English phonemes (American accent I believe) in Sun audio format
- files. See Question 1.8 for information on audio file formats.
- + ftp://sounds.sdsu.edu/.1/phonemes: This ftp site appears to
- be obsolete. Does anyone know a new address?
- + ftp://phloem.uoregon.edu/pub/Sun4/lib/phonemes : There
- appears to be some config problem with this ftp server.
- + ftp://sunsite.unc.edu/pub/multimedia/sun-sounds/phonemes
-
- Linguistic Data Consortium (LDC)
- * Briefly stated, the LDC has been established to broaden the
- collection and distribution of speech and natural language data
- bases for the purposes of research and technology development in
- automatic speech recognition, natural language processing and
- other areas where large amounts of linguistic data are needed.
- Here is list of some of the corpora:
- + The TIMIT and NTIMIT speech corpora
- + The Resource Management speech corpus (RM1, RM2)
- + The Air Travel Information System (ATIS0) speech corpus
- + The Association for Computational Linguistics - Data
- Collection Initiative text corpus (ACL-DCI)
- + The TI Connected Digits speech corpus (TIDIGITS)
- + The TI 46-word Isolated Word speech corpus (TI-46)
- + The Road Rally conversational speech corpora (including
- "Stonehenge" and "Waterloo" corpora)
- + The Tipster Information Retrieval Test Collection
- + The Switchboard speech corpus ("Credit Card" excerpts and
- portions of the complete Switchboard collection)
- * Further resources made available in the first year (or two):
- + The Machine-Readable Spoken English speech corpus (MARSEC)
- + The Edinburgh Map Task speech corpus
- + The Message Understanding Conference (MUC) text corpus of FBI
- terrorist reports
- + The Continuous Speech Recognition - Wall Street Journal
- speech corpus (WSJ-CSR)
- + The Penn Treebank parsed/tagged text corpus
- + The Multi-site ATIS speech corpus (ATIS2)
- + The Air Traffic Control (ATC) speech corpus
- + The Hansard English/French parallel text corpus
- + The European Corpus Initiative multi-language text corpus
- (ECI)
- + The Int'l Labor Organization/Int'l Trade Union multi-language
- text corpus (ILO/ITU)
- + Machine-readable dictionaries/lexical data bases (COMLEX,
- CELEX)
- * Detailed information about the Linguistic Data Consortium is
- available by anonymous from the address below. The files in the
- directory include more detailed information on the individual
- databases.
- + ftp://ftp.cis.upenn.edu/pub/ldc
- * For further information contact
- Linguistic Data Consortium
- 441 Williams Hall, University of Pennsylvania
- Philadelphia, PA 19104-6305
- Phone: +1 (215) 898-0464
- Fax: +1 (215) 573-2175
- e-mail: ldc@unagi.cis.upenn.edu
-
- Center for Spoken Language Understanding (CSLU)
- * The ISOLET speech database of spoken letters of the English
- alphabet. The speech is high quality (16 kHz with a noise
- cancelling microphone). 150 speakers x 26 letters of the English
- alphabet twice in random order. The ISOLET data base can be
- purchased for $100 by sending an email request to
- vincew@cse.ogi.edu. (This covers handling, shipping and medium
- costs). The data base comes with a technical report describing the
- data.
- * CSLU has a telephone speech corpus of 1000 English alphabets.
- Callers recite the alphabet with brief pauses between letters.
- This database is available to not-for-profit institutions for
- $100. The data base is described in the proceedings of the
- International Conference on Spoken Language Processing.
- + Contact vincew@cse.ogi.edu if interested.
- * CSLU has released for universities its Continuous English Speech
- Corpus. The corpus contains recorded speech from 690 different
- speakers, with label files at various levels - including word
- level and phonetic labels. The data were collected as part of the
- OGI Multi-language telephone corpus. CSLU provides speech corpora
- to all universities without charge. To order a corpus, print the
- license agreement/order form, complete it, and fax it to the CSLU.
- A description of the corpora and an order form are available by
- anonymous ftp:
- + ftp://speech.cse.ogi.edu/pub/releases
- * Contact: Mike Noel -
- email: noel@cse.ogi.edu Phone: (503) 690-1309
-
- PhonDat - A Large Database of Spoken German
- * The PhonDat continuous speech corpora are now available on CD-ROM
- media (ISO 9660 format).
- + PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM)
- + PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM)
- * PhonDat I comprises approx. 20.000, PhonDat II approx. 1500 signal
- files in high quality 16-bit 16 KHz recording. The corpora come
- with documentation containing the orthographic transcription and a
- citation form of the utterances, as well as a detailed file format
- description. A narrow phonetic transcription is available for
- selected files from corpus I and II.
- * For information and orders contact
- Barbara Eisen
- Institut fuer Phonetik
- Schellingstr. 3 / II
- D 80799 Munich 40
- Tel: +49 / 89 / 2180 -2454 or -2758
- Fax: +49 / 89 / 280 03 62
-
- Oxford Acoustic Phonetic Database
- * Available on compact disc, from J. Pickering and B. Rosner. It
- contains data on vowel-consonant and consonant-vowel combinations
- in both stressed and unstressed locations. The language covered
- include French, German, Hungarian, Italian, Japanese, British
- English, Spanish and English. For further information write to
- Electronic Publishing, Oxford University
- Press, Walton Street, Oxford OX2 6DP, UK.
- The ISBN is 0-19-268086-2
- * Contact:
- Prof. B. Rosner
- Dept. of Experimental Psychology
- South Parks Rd, Oxford, OX1 3UD, UK
- email: burton.rosner@wolfson.ox.ac.uk
-
- _________________________________________________________________
-
- Q1.8: SPEECH FILE FORMATS, CONVERSION AND PLAYING.
-
- Section 2 of this FAQ has information on mu-law coding.
-
- A very good and very comprehensive list of audio file formats is
- prepared by Guido van Rossum. The list is posted regularly to comp.dsp
- and alt.binaries.sounds.misc, amongst others. It includes information
- on sampling rates, hardware, compression techniques, file format
- definitions, format conversion, standards, programming hints and lots
- more. It is also available by ftp from
- * ftp://ftp.cwi.nl/pub/audio/AudioFormats.part1,2
-
- _________________________________________________________________
-
- Q1.9: WHAT "SPEECH LABORATORY ENVIRONMENTS" ARE AVAILABLE?
-
- First, what is a Speech Laboratory Environment? A speech lab is a
- software package which provides the capability of recording, playing,
- analysing, processing, displaying and storing speech. Your computer
- will require audio input/output capability. The different packages
- vary greatly in features and capability - best to know what you want
- before you start looking around.
-
- Most general purpose audio processing packages will be able to process
- speech but do not necessarily have some specialised capabilities for
- speech (e.g. formant analysis).
-
- The following article provides a good survey.
- * Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An
- Evaluation" Journal of Speech and Hearing Research, pp 314-332,
- April 1992.
-
- Entropic Signal Processing System (ESPS) and Waves
- * Platform: Range of Unix platforms.
- * Description: ESPS is a comprehensive set of speech
- analysis/processing tools for the UNIX environment. The package
- includes UNIX commands, and a comprehensive C library (which can
- be accessed from other languages). Waves is a graphical front-end
- for speech processing. Speech waveforms, spectrograms, pitch
- traces etc can be displayed, edited and processed in X windows and
- Openwindows (versions 2 & 3). Waves also includes a signal
- labelling utility which provides multiple feature labelling and
- useful features for fast labelling of large speech databases.
- Entropic also distributes HTK (the Hidden Markov Model Toolkit).
- HTK is described in Section 6 of this FAQ.
- * Cost: On request.
- * Contact:
- Entropic Research Laboratory, Washington Research Laboratory
- 600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
- (202) 547-1420
- email - info@entropic.com
-
- CSRE: Canadian Speech Research Environment
- * Platform: IBM/AT-compatibles
- * Description: CSRE is a microcomputer-based system designed to
- support speech research. CSRE provides a low-cost facility in
- support of speech research, using mass-produced and
- widely-available hardware. The project is non-profit, and relies
- on the cooperation of researchers at a number of institutions and
- fees generated when the software is distributed. Functions include
- speech capture, editing, and replay; several alternative spectral
- analysis procedures, with color and surface/3D displays; parameter
- extraction/ tracking and tools to automate measurement and support
- data logging; alternative pitch-extraction systems; parametric
- speech (KLATT80) and non-speech acoustic synthesis, with a variety
- of supporting productivity tools; and an experiment generator, to
- support behavioral testing using a variety of common testing
- protocols. A paper about the whole package can be found in:
- + Jamieson D.G. et al, "CSRE: A Speech Research Environment",
- Proc. of the Second Intl. Conf. on Spoken Language
- Processing, Edmonton: University of Alberta, pp. 1127-1130.
- * Hardware: Can use a range of data aqcuisition/DSP hardware
- * Cost: Distributed on a cost recovery basis.
- * Availability: For more information on availability contact
- Krystyna Marciniak
- email march@uwovax.uwo.ca
- Tel (519) 661-3901 Fax (519) 661-3805.
- For technical information
- email ramji@uwovax.uwo.ca
- * Note: Also included in Q5.4 on speech synthesis packages.
-
- OGI Speech Tools
- * Developers from the Center for Spoken Language Understanding
- (CSLU) at the Oregon Graduate Institute of Science and Technology
- (Portland Oregon)
- * Platform: Unix
- * Description: The OGI Speech tools include :
- + An X windows display tool (LYRE) for displaying data in a
- time synchronous fashion for a. the speech signal b.
- spectrograms c. phoneme labels, and other information.
- + A Neural Network (NOPT) training package.
- + An set of C library routines (LIBNSPEECH) for the
- manipulation of speech data, including: a. PLP Analysis, b.
- Rasta PLP Analysis, c. Linear Predictive Coding, d. Mel
- Cepstrum Coding, e. Fast Fourier Transform
- + A set of utilities for converting file formats such as ADC,
- NIST, mu-law, binary files, and ascii. Includes filtering.
- + A database utility (find_phone) to automate speech database
- related enquiries. It allows the user to specify a particular
- label or set of labels in a given context, display all
- occurrences of the label, and relabel the occurrences if
- desired.
- + A Vector-Quantizer based on the Linde Buzo and Gray (LBG)
- algorithm.
- + A set of PERL Scripts which have been used mainly to automate
- the use of the OGI Speech Tools.
- + MAN Pages for all routines and programs developed, as well as
- a User manual in both in postscript and tex format.
- * Misc: Software is written in ANSI C.
- * Availability: By anonymous ftp from
- + ftp://speech.cse.ogi.edu/pub/tools/
- * Contact: Try tools@cse.ogi.edu
-
- Matlab plus Signal Processing Toolbox
- * Platform: Wide range
- * Description: Matlab (MATrix LABoratory) is a technical computing
- environment for numerical computation and visualization based on a
- matrix oriented, interpreted programming language. The programming
- environment provides support for the development of customized
- operations, along with debugging facilities and a graphical user
- interface toolkit. Audio output is provided.
-
- A specialised Signal Processing Toolbox is available which
- provides many functions which are useful for speech analysis. It
- includes filter design, spectral estimation, statistical signal
- processing, waveform generation, and signal and spectrogram
- display.
-
- A specialised Auditory Toolbox is available which contains
- functions useful to people interested in auditory/cochlear models.
- A more detailed description is given in Q1.10.
- * Price: On request.
- * Contact: The Math Works Inc.
- 24 Prime Park Way, Natick, MA 01760-1500 USA
- Ph: 1-508-653 1415 Fax: 1-508-653 6284
- Email: info@mathworks.com
- * FTP: ftp://ftp.mathworks.com
- * WWW: http://www.mathworks.com/
-
- Signalyze 3.0 from InfoSignal
- * Platform: Macintosh
- * Description: Signalyze's basic conception revolves around up to
- 100 signals, displayed synchronously in HyperCard fashion on
- "cards". The program offers a complement of signal editing
- features, quite a few spectral analysis tools, manual scoring
- tools, pitch extraction routines, a good set of signal
- manipulation tools, and extensive input-output capacity.
-
- Handles multiple file formats: Signalyze, MacSpeech Lab,
- AudioMedia, SoundDesigner II, SoundEdit/MacRecorder, SoundWave,
- three sound resource formats, and ASCII-text. Sound I/O: Direct
- sound input from MacRecorder and similar devices, AudioMedia,
- AudioMedia II and AD IN, some MacADIOS boards and devices, Apple
- sound input (built-in microphone). Sound output via Macintosh
- internal sound, via SoundManager 3.0, some MacADIOS boards and
- devices as well as via the Digidesign 16-bit boards.
-
- It has a range of capabilities for creating, editing and
- manipulating label files with flexibility in labelling format.
- * Compatibility: MacPlus and higher (including II, IIx, IIcx,
- IIci, IIfx, IIvx, IIvi, Portable, all PowerBooks, Centris and
- Quadras). Takes advantage of large and multiple screens and 16/256
- color/grayscales. System 7.0 compatible. Runs in background with
- adjustable priority.
- * Misc: A demo available upon request. Manuals and tutorial
- included. It is available in English, French, and German. An
- UPDATER to version 2.48 is now available in:
- + - The UNIL Gopher server (see last page of InfoSignal News 8)
- + - The LAIP FTP server. Address: MACFL4082.unil.ch, machine
- no. 130.223.104.31
- Also available are a demo program, and current questions and answers.
- * Cost: Individual licence US$350, site license US$500, plus
- shipping. Upgrades from version 2.0 are available.
- * Contact:
- North America - Network Technology Corporation
- 91 Baldwin St., Charlestown MA 02129
- Fax: 617-241-5064 Phone: 617-241-9205
- Elsewhere contact
- InfoSignal Inc.
- C.P. 73, 1015 LAUSANNE, Switzerland,
- FAX: +41 21 691-1372,
- Email: 76357.1213@COMPUSERVE.COM.
-
- Kay Elemetrics CSL (Computer Speech Lab) 4300
- * Platform: Minimum IBM PC-AT compatible with extended memory (min
- 2MB) with at least VGA graphics. Optimal would be 386 or 486
- machine with more RAM for handling larger amounts of data.
- * Description: Speech analysis package, with optional separate LPC
- program for analysis/synthesis. Uses its own file format for data,
- but has some ability to export data as ascii. The main
- editing/analysis prog (but not the LPC part) has its own macro
- language, making it easy to perform repetitive tasks. Probably not
- much use without the extra LPC program, which also allows
- manipulation of pitch, formant and bandwidth parameters.
-
- Hardware includes an internal DSP board for the PC (requires ISA
- slot), and an external module containing signal processing chips
- which does A/D and D/A conversion.
- * Misc: A programmers kit is available for programming signal
- processing chips (experts only). A speaker and microphone are
- supplied. Manuals are included.
- * Cost: Recently approx 6000 pounds sterling.
- * Contact:
- UK distributors are Wessex Electronics,
- 114-116 North Street, Downend, Bristol, B16 5SE
- Tel: 0272 571404.
- In the USA contact:
- Kay Elemetrics Corp,
- 12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798
- Tel:(201) 227-7760
-
- MacSpeech Lab II (MSL II)
- * Platform: Macintosh
- * Description: A sound analysis and acquisition for Macs. MSL II
- delivers the most common functions for speech analysis (FFTs,
- LPCs, f0 extraction, etc.) & produces grayscale spectrographic
- displays. Can be used for various speech technology and phonetic
- training tasks. The software an trade off accuracy and speech.
- * Hardware: Requires MacADIOS ("Macintosh Analog/Digital
- Input/Output System") hardware for speech I/O at 12/16 bits.
- * Misc: Software no longer updated by GW Instruments; MSL
- soft/hardware will not perform input/output on Quadras, for
- example, though analysis seems fine. Known to operate properly on
- systems as high as IIcx & II fx.
- * Cost: $4990 (in May '92 price list; no MSL soft/hardware package
- listed in January '93).
- * Contact:
- GW Instruments
- 35 Medford Street, Somerville, MA 02143
- Phone: (617) 625-4096 Fax: (617) 625-1322
-
- N!Power
- * Platform: SUN, DEC and HP workstations.
- * Description: An object-oriented software package with a MOTIF
- GUI interface and a range of functionality for data
- analysis/editing, signal analysis, speech processing, real-time
- A/D and D/A, and 2D/3D interactive graphics. N!Power replaces ILS.
-
- N!Power can provide a Block Diagram user interface, menus,
- pop-ups, and a high-level IEEE standard symbolic scripting
- language. You can customize the blocks, menus and pop-ups with
- mouse point-and-click operations.
- * Contact:
- Signal Technology, Inc.
- 104 W. Anapamu, Suite J, Santa Barbara, CA 93101-3126
- Phone: 805-899-8300 FAX: 805-899-4344
- email: larry@signal.com
-
- Ptolemy
- * Platform: Sun SPARC, DecStation (MIPS), HP (hppa).
- * Description: Ptolemy provides a highly flexible foundation for
- the specification, simulation, and rapid prototyping of systems.
- It is an object oriented framework within which diverse models of
- computation can co-exist and interact. Ptolemy can be used to
- model entire systems.
-
- Ptolemy has been used for a broad range of applications including
- signal processing, telecomunications, parallel processing,
- wireless communications, network design, radio astronomy, real
- time systems, and hardware/software co-design. Ptolemy has also
- been used as a lab for signal processing and communications
- courses. Ptolemy has been developed at UC Berkeley over the past 3
- years. Further information, including papers and the complete
- release notes, is available from the FTP site.
- * Cost: Free
- * Availability: The source code, binaries, and documentation are
- available by anonymous ftp from
- + ftp://ptolemy.berkeley.edu/pub/README
-
- Khoros
- * Description: Public domain image processing package with a basic
- DSP library. Not particularly applicable to speech, but not bad
- for the price.
- * Cost: Free
- * Availability: By anonymous ftp from ftp://pprg.eece.unm.edu
-
- SpeechViewer II
- * Description: Speech Therapy Tool. See the detailed description
- in the handicap section - Q1.6.
-
- _________________________________________________________________
-
- Q1.10: MISCELANEOUS SOFTWARE AND OTHER RESOURCES.
-
- CMU dictionary
- * Description: Phonemic transcriptions of 100,000 words with
- American English pronunciation.
- * Availability: By anonymous ftp from the directory
- + ftp://ftp.cs.cmu.edu/project/fgdata/dict
- with the files README, cmudict.0.2.Z, cmulex.0.1.Z, phoneset.0.1
-
- Dictionary
- * Description: A comprehensive word list which should contain most
- common American words, abbreviations, hyphenations, and even
- incorrect spellings. The word lists were compiled from a number of
- sources: commercial news services, UseNet news postings, existing
- dictionaries, name lists, company lists, UNIX man pages, project
- Gutenberg's E-texts, project Wordnet, received mailings, etc. The
- current size is 460,000 words.
- * Availability: By anonymous ftp from
- + ftp://wocket.vantage.gte.com:/pub/standard_dictionary
-
- Note 1: There seems to be some sort of network problem reaching
- the server.
- Note 2: There is a README file which explains the file formats.
-
- BEEP dictionary
- * Description: Phonemic transcriptions of 100,000 English words.
- (British English pronunciations)
- * Availability: By anonymous ftp from the file
- + svr-ftp.eng.cam.ac.uk/comp.speech/data/beep-0.3.tar.Z
-
- CUVOLAD dictionary
- * Description: Computer Usable Version of the Oxford Advanced
- Learner's Dictionary Has British English pronunciations and parts
- of speech
- * Availability: By anonymous ftp from the directory
- + ftp://black.ox.ac.uk/ota/dicts/710
-
- MRC database
- * Description: The Medical Research Council Psycholinguistic
- Database Has British English pronunciations, parts of speech, word
- frequency and lots of other information.
- * Availability: By anonymous ftp from the directory
- + ftp://black.ox.ac.uk/ota/dicts/1054
-
- Network Audio System Release 1.1
- * Platforms: Various (includes SunOS, Solaris, SGI)
- * Description: A device-independent mechanism for transferring,
- playing and recording audio signals over a network. Has a range of
- features suited to networks.
- * Cost: Free
- * Availability: By anonymous ftp from
- + ftp://ftp.x.org:/contrib/audio/nas/netaudio-1.2.tar.gz
- Also available in the same directory are document files and some
- sample sounds.
-
- AF version AF3R1
- * Platforms: DEC workstations (Alpha and MIPS), SparcStation, SGI
- * Description: The AF System is a device-independent
- network-transparent system including client applications and audio
- servers. With AF, multiple audio applications can run
- simultaneously, sharing access to the actual audio hardware.
-
- The AF3R1 distribution of AF includes server support for Digital
- RISC systems running Ultrix, Digital Alpha AXP systems running
- OSF/1, SGI Indigo running IRIX 4.0.5, Sun Microsystems
- SPARCstations running SunOS 4.1.3, and Sun Microsystems
- SPARCstations running Solaris 2.3. The servers support audio
- hardware ranging from the built-in CODEC audio on SPARCstations
- and Personal DECstations to 48 KHz stereo audio using the DECaudio
- TURBOchannel module or the SPARCstation DBRI interface
- * Availability: The source kit is distributed by anonymous ftp
- from
- + ftp://crl.dec.com/pub/DEC/AF
- * Contact: af-request@crl.dec.com
- + http://www.research.digital.com/CRL/projects/AF/home.html
-
- NEVOT (1.4v) from AT&T; BL
- * Platforms: Sun Sparc Station (SunOS 4.1.x) and Silicon Graphics
- * Description: Audio-conferencing tool which supports both
- point-to-point and broadcasting of audio using multicast IP. Audio
- encoding:
- + PCM 64kb/s 8-bits u-law encoded 8KHz PCM (G.711)
- + ADPCM 32 kb/s [Sun only] (G.721)
- + DVI ADPCM 32 kb/s
- + ADPCM 24 kb/s [Sun only] (G.723)
- + CELP 4.8 kb/s
- + LPC 2.4 kb/s
- Source is available.
- * Availability: by anonymous ftp from
- + ftp://gaia.cs.umass.edu/pub/hgschulz/nevot
- * Contact: Henning Schulzrinne (hgs@researh.att.com)
-
- Human Audio Perception Document
- * Description: Document prepared by Argiris Kranidiotis on the
- human audio perception system. It lists a number of references,
- gives plenty of numbers and some equations.
- * Availability: by anonymous ftp from the comp.speech archive site
- +
- ftp://svr-ftp.eng.cam.ac.uk/comp.speech/info/HumanAudioPercept
- ion
- * Contact:
- Argiris A. Kranidiotis
- University Of Athens, Informatics Department
- email: akra@zeus.di.uoa.ariadne-t.gr
-
- Homophone List
- * A list of homophones in General American English is available by
- anonymous FTP from the comp.speech archive site:
- +
- ftp://svr-ftp.eng.cam.ac.uk/comp.speech/data/homophones-1.01.t
- xt
-
- Auditory Toolbox for Matlab
- * Description: This toolbox provides extensions to Matlab which
- are useful to people interested in auditory/cochlear modeling.
- [Matlab is described is the previous section.] This toolbox has
- been tested on both Macintosh and Unix computers. It includes the
- following major models:
- + Lyon's Passive Long Wave Cochlear Model (our conventional
- model)
- + Patterson-Holdsworth ERB Filter bank with Meddis Hair cell
- + Seneff's Auditory Model (Stages I and II)
- + MFCC (Mel-scale frequency cepstral coefficients from the ASR
- world)
- + Spectrogram
- + Correlogram generation and pitch modeling
- + Simple vowel synthesis
- * Availability: By anonymous FTP from the following site:
- + ftp://ftp.apple.com/pub/malcolm
- The following files are available:
- + 419487 AuditoryToolbox.mif.Z
- + 1372976 AuditoryToolbox.psc.Z
- + 573215 AuditoryToolbox.sea.hqx
- + 92160 AuditoryToolbox.tar
- + 36405 AuditoryToolbox.tar.Z
- The ".mif.Z" file is a Unix compressed version of the FrameMaker
- documentation. The ".psc.Z" file is a Unix compressed version of
- the Postscript documentation. The ".tar" and ".tar.Z" files are
- Unix TAR archives containing all of the m-functions and C-MEX
- source code. Finally, the ".sea.hqx" file is a Macintosh
- self-extracting archive that has been encoded using BinHex. We do
- provide precompiled version of the three MEX function for the
- Macintosh.
- * Misc: Our lawyers ask you to remind you that there is no
- warranty. We've done some testing but we undoubtably missed
- things.
- * Contact:
- Malcolm Slaney: Interval Resarch.
- Email: malcolm@interval.com
-
- Auditory Modeller 1
- * Description: John Holdsworth's implementation of a gammatone
- filter bank and Roy Patterson's spiral model, in C (with X-window
- display).
- * Availability: By anonymous ftp from
- + ftp://ftp.mrc-apu.cam.ac.uk/pub/aim
-
- Auditory Modeller 2
- * Description: Lowel O'Mard's implementation of peripheral
- filtering, Ray Meddis's hair cell model and other stuff in C (as a
- library of routines).
- * Availability: By anonymous ftp from
- + ftp://suna.lut.ac.uk/public/hulpo/lutear
-
- _________________________________________________________________
-
-
-
-
- Andrew Hunt
- ---
- Speech Technology Research Group Ph: 61-2-351 4509
- Dept. of Electrical Engineering Fax: 61-2-351 3847
- University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au
- Archive-name: comp-speech-faq/part2
- Last-modified: 1995/01/19
-
-
- COMP.SPEECH FAQ POSTING - PART 2/3
-
-
- [Note: this document has been automatically extracted from a WWW site:
- http://www.speech.su.oz.au/comp.speech
- This may introduce some formatting errors.]
-
-
-
- ===========================================================================
-
-
- FAQ SECTION 2 - Signal Processing for Speech
-
- Q2.1: WHAT SAMPLING DO I NEED FOR SPEECH?
-
- For recorded speech to be understood by humans you need an 8kHz
- sampling rate or more and at least 8 bit sampling. This produces poor
- quality speech - but in can be understood.
-
- Improvements can be achieved by increasing the number of bits in
- sampling to 12bits or 16bits, or by using a non-linear encoding
- technique such as mu-law or A-law (see Q2.7). This improves the
- "signal-to-noise" ratio.
-
- Increasing the sampling rate above 8kHz, say to 10kHz, 16kHz or 20Khz,
- improves the frequency response: the higher the sampling frequency the
- better the high frequency content will be. A 16kHz sampling rate is a
- reasonable target for high quality speech recording and playback.
-
- When doing speech recognition you need to remember that the your
- computer is not as good as your ear so it will have trouble with poor
- quality sounds. The choice of an appropriate sampling setup depends
- very much on the speech recognition task and the amount of computer
- power available.
- _________________________________________________________________
-
- Q2.2: HOW DO I FIND THE PITCH OF A SPEECH SIGNAL?
-
- This topic comes up regularly in the comp.dsp newsgroup. Question 2.5
- of the FAQ posting for comp.dsp gives a comprehensive list of
- references on the definition, perception and processing of pitch.
- _________________________________________________________________
-
- Q2.3: HOW DO I FIND THE START AND END POINTS OF A SPEECH SIGNAL?
-
- A large number of papers have been presented on this task. Try the
- following papers:
- * Rabiner LR, Sambur MR, "An Algorithm for Determining the Endpoints
- of Isolated Utterances", Bell System Technical Journal, Vol 54,
- No. 2, pp 297-315, 1975.
- * Drago, P.G. et al. "Digital Dynamic Speech Detectors." IEEE Trans
- on Communications, Vol 26, No 1, Jan 78, pp. 140-145.
- * Newman, W.C. "Detecting Speech with an Adapative Neural Network."
- Electronic Design. 22 March 1990.
- * Taboada. J et al "Explicit Estimation of Speech Boundaries" IEE
- Proc. Sci. Meas. Technol., Vol 141, No.3, May 1994 pp153-159.
-
- _________________________________________________________________
-
- Q2.4: WHERE CAN I FIND FFT SOFTWARE?
-
- Try the following file available by anonymous ftp. It contains a
- series of optimised fft routines, including mixed-radix algorithms.
- The .gz suffix indicates GNU zip format.
- * ftp://usc.edu/pub/C-numanal/fft-stuff.tar.gz
-
- _________________________________________________________________
-
- Q2.5: WHAT SIGNAL PROCESSING TECHNIQUES ARE USED IN SPEECH TECHNOLOGY?
-
- This question is far to big to be answered in a FAQ posting.
- Fortunately there are many good books which answer the question. Some
- good introductory books include
- * Digital processing of speech signals; L. R. Rabiner, R. W.
- Schafer. Englewood Cliffs; London: Prentice-Hall, 1978
- * Voice and Speech Processing; T. W. Parsons. New York; McGraw Hill
- 1986
- * Computer Speech Processing; ed Frank Fallside, William A. Woods
- Englewood Cliffs: Prentice-Hall, c1985
- * Digital speech processing : speech coding, synthesis, and
- recognition edited by A. Nejat Ince; Kluwer Academic Publishers,
- Boston, c1992
- * Speech science and technology; edited by Shuzo Saito pub. Ohmsha,
- Tokyo, c1992
- * Speech analysis; edited by Ronald W. Schafer, John D. Markel New
- York, IEEE Press, c1979
- * Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal
- Processing, 1987.
- * Discrete-time processing of speech signals; John R Deller, John G
- Proakis, John H L Hansen; Macmillan 1993.
- * Signal processing of speech; F J Owens; Macmillan 1993.
-
- _________________________________________________________________
-
- Q2.6: WHAT SPEECH SAMPLING AND SIGNAL PROCESSING HARDWARE CAN I USE?
-
- In addition to the following information, have a look at the Audio
- File format document prepared by Guido van Rossum (see details in
- Section 1.8).
-
- Can anyone provide information on Mac, SGI, NeXT and other hardware?
-
- Sun standard audio port: SPARC I & II
- * Input and Output: 1 channel, 8 bit mu-law encoded, 8kHz sample
- rate. This provides telephone quality sampling.
-
- Sun standard audio port (SPARC 10 & 20)
- * Input and Output: Stereo (2 channels). 16-bit linear sampling.
- Multiple sample rates (48000, 44100, 37800, 32000, 22050, 18900,
- 16000, 11025, 9600, 8000 Hz)
-
- Macintosh Audio Hardware - an overview
- * Description: ALL Macintosh computers come with the ability to
- play back sounds at any sample rate (sample rate conversion is
- done in software.) Older machines have 8 bit stereo output
- (hardware runs at 22254 samples/second). The newer machines have
- 16 bit stereo hardare running at 44100 samples/second.
-
- Most of the recent Macintosh computers come with sound input
- hardware. There are probably exceptions to this, but the older and
- some of the current low-end machines have 8 bit (linear) mono
- hardware running at 22254.54 samples/second. All of the PowerPC,
- AV, and the 500 series notebook computers come with 16 bit 44kHz
- stereo sampling hardware. They can also record at 22050
- samples/second. The sound manager implements an AGC (Automatic
- Gain Control) function for the 8 bit hardware. The drivers have a
- switch to turn off the AGC.
-
- There are a number of DSP vendors that support high quality audio.
- Generally this means quieter analog sections, and more IO formats
- (AES/IBU, for example). Try DigiDesign and Spectral Innovations.
-
- The software drivers for sound are described in "Inside Macintosh:
- Sound". If you want to see some sample code check out the sources
- for the Matlab "Sound and Image Toolbox". They can be found at
- + ftp://ftp.apple.com/pub/malcolm/SoundAndImageToolbox.cpt.hqx
-
- Routines that play and record sounds using the toolbox are
- included (and interfaced to Matlab).
-
- Ariel Signal Processors
- * Platform: Various
- * Description: A range of signal I/O, A/D, D/A and DSP products
- are available. There are too many to list.
- * Contact:
- Ariel Corp.
- 433 River Road, Highland Park, NJ 08904.
- Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124
-
- IBM RS/6000 ACPA (Audio Capture and Playback Adapter)
- * Description: The card supports PCM, Mu-Law, A-Law and ADPCM at
- 44.1kHz (& 22.05, 11.025, 8kHz) with 16-bits of resolution in
- stereo. The card has a built-in DSP (don't know which one). The
- device also supports various formats for the output data, like
- big-endian, twos complement, etc. Good noise immunity.
-
- The card is used for IBM's VoiceServer (they use the DSP for
- speech recognition). Apparently, the IBM voiceserver has a
- speaker-independent vocabulary of over 20,000 words and each ACPA
- can support two independent sessions at once.
- * Cost: $US495
- * Contact: ?
-
- Sound Galaxy NX , Aztech Systems
- * Platform: PC - DOS,Windows 3.1
- * Cost: ?
- * Input: 8bit linear, 4-22 kHz.
- * Output: 8bit linear, 4-44.1 kHz
- * Misc: 11-voice FM Music Synthesizer YM3812; Built-in power
- amplifier; DSP signal processing support - ST70019SB Hardware
- ADPCM decompression (2:1,3:1,4:1) "AdLib" and "Sound Blaster"
- compatbility. Software includes a simple Text-to-Speech program
- "Monologue".
-
- Sound Galaxy NX PRO, Aztech Systems
- * Platform: PC - DOS,Windows 3.1
- * Cost: ?
- * Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono).
- * Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono)
- * Misc: 20-voice FM Music Synthesizer; Built-in power amplifier;
- Stereo Digital/Analog Mixer; Configuration in EEPROM. Hardware
- ADPCM decompression (2:1,3:1,4:1). Includes DSP signal processing
- support. "AdLib" and "Sound Blaster Pro II" compatybility.
- Software includes a simple Text-to-Speech program "Monologue" and
- Sampling laboratory for Windows 3.1: WinDAT.
- * Contact: USA (510)6238988
-
- ATI Stereo F/X Sound Board
- * Platform: PC XT or AT - DOS, Windows 3.0, 3.1
- * Cost: $120 Canadian
- * Description: Input - 8 bit ADC, 44.1 kHz mono, 22.05 kHz Stereo.
- Output - Dynamic range = 48 dB, 32 anti-aliasing filters. Adds
- Stereo effect to existing mono Adlib or Sound Blaster apps.
- 11-voice YAMAHA FM Music Synthesizer. Built-in 8 watt power
- amplifier, 4 watts per channel. Volume ctrl on rear. 2 Joystick
- input, software setup (no switches), software included. "AdLib"
- and "Sound Blaster" compatibility. DMA support for high speed
- digital audio. ADPCM decomp @ 4:1, 3:1, 2:1. Will play .WAV files.
- Optional MIDI I/O port $79. (MIDI IN, OUT, THRU, and sequencer).
- * Contact:
- ATI Technologies Inc.
- 3761 Victoria Park Avenue, Scarborough, Ontario
- CANADA, M1W 3S2
- Ph: (416) 756-0711 Fax: (416) 756-0720
- BBS: (416) 764-9404 (9600 baud N.8.1)
-
- Other PC Sound Cards
- ============================================================================
- sound stereo/mono compatible included voices
- card & sample rate with ports
- ============================================================================
- Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3)
- 1000 16-bit 44.1khz in/out, +2 digital
- mono: 8-bit 44.1khz mic in, channels
- 16-bit 44.1khz joystick,
- MIDI
-
- Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth.
- FM synth with in/out,
- 2 operators joystick,
-
- Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22
- Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out,
- joystick,
-
- Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11
- Pro mono: 8-bit 44.1khz Sound Blaster in/out
- joystick,
- MIDI, SCSI
-
- Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20
- 16 ASP stereo: 16-bit 4-44.1khz in/out,
- joystick,
- MIDI
-
- Audio Port mono: 8-bit 22.05khz Adlib audio 11
- Sound Blaster in/out,
- joystick
-
- Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20
- Spectrum + Pro Audio in/out,
- Spectrum joystick
-
- Pro Audio stereo: 16-bit 44.1khz Adlib audio 20
- Spectrum 16 Pro Audio in/out,
- Spectrum joystick,
- Sound Blaster MIDI, SCSI
-
- Thunder Board stereo: 8-bit 22khz Adlib audio 11
- Sound Blaster in/out,
- joystick
-
- Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled
- Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth.
- amplified
- out,
- (w/16-bit daughtercard) mic in, CD
- stereo: 16-bit 44.1khz audio in,
- mono: 16-bit 44.1khz daughterboard
- ports (for
- SCSI and
- 16-bit)
-
- MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled
- 64x oversampling in/out,
- joystick,
- MIDI
-
- =============================================================================
-
- _________________________________________________________________
-
- Q2.7: HOW DO I CONVERT TO/FROM MU-LAW FORMAT?
-
- Mu-law coding is a form of compression for audio signals including
- speech. It is widely used in the telecommunications field because it
- improves the signal-to-noise ratio without increasing the amount of
- data. Typically, mu-law compressed speech is carried in 8-bit samples.
- It is a companding technqiue. That means that carries more information
- about the smaller signals than about larger signals.
-
- On SUN Sparc systems have a look in the directory /usr/demo/SOUND.
- Included are table lookup macros for ulaw conversions. [Note however
- that not all systems will have /usr/demo/SOUND installed as it is
- optional - see your system admin if it is missing.]
-
- OR, here is some sample conversion code in C.
- /**
- ** Signal conversion routines for use with Sun4/60 audio chip
- **/
-
- #include stdio.h
-
- unsigned char linear2ulaw(/* int */);
- int ulaw2linear(/* unsigned char */);
-
- /*
- ** This routine converts from linear to ulaw
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** Joe Campbell: Department of Defense
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) "A New Digital Technique for Implementation of Any
- ** Continuous PCM Companding Law," Villeret, Michel,
- ** et al. 1973 IEEE Int. Conf. on Communications, Vol 1,
- ** 1973, pg. 11.12-11.17
- ** 3) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: Signed 16 bit linear sample
- ** Output: 8 bit ulaw sample
- */
-
- #define ZEROTRAP /* turn on the trap as per the MIL-STD */
- #define BIAS 0x84 /* define the add-in bias for 16 bit samples */
- #define CLIP 32635
-
- unsigned char
- linear2ulaw(sample)
- int sample; {
- static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,
- 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
- 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
- int sign, exponent, mantissa;
- unsigned char ulawbyte;
-
- /* Get the sample into sign-magnitude. */
- sign = (sample >> 8) & 0x80; /* set aside the sign */
- if (sign != 0) sample = -sample; /* get magnitude */
- if (sample > CLIP) sample = CLIP; /* clip the magnitude */
-
- /* Convert from 16 bit linear to ulaw. */
- sample = sample + BIAS;
- exponent = exp_lut[(sample >> 7) & 0xFF];
- mantissa = (sample >> (exponent + 3)) & 0x0F;
- ulawbyte = ~(sign | (exponent << 4) | mantissa);
- #ifdef ZEROTRAP
- if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */
- #endif
-
- return(ulawbyte);
- }
-
- /*
- ** This routine converts from ulaw to 16 bit linear.
- **
- ** Craig Reese: IDA/Supercomputing Research Center
- ** 29 September 1989
- **
- ** References:
- ** 1) CCITT Recommendation G.711 (very difficult to follow)
- ** 2) MIL-STD-188-113,"Interoperability and Performance Standards
- ** for Analog-to_Digital Conversion Techniques,"
- ** 17 February 1987
- **
- ** Input: 8 bit ulaw sample
- ** Output: signed 16 bit linear sample
- */
-
- int
- ulaw2linear(ulawbyte)
- unsigned char ulawbyte;
- {
- static int exp_lut[8] = {0,132,396,924,1980,4092,8316,16764};
- int sign, exponent, mantissa, sample;
-
- ulawbyte = ~ulawbyte;
- sign = (ulawbyte & 0x80);
- exponent = (ulawbyte >> 4) & 0x07;
- mantissa = ulawbyte & 0x0F;
- sample = exp_lut[exponent] + (mantissa << (exponent + 3));
- if (sign != 0) sample = -sample;
-
- return(sample);
- }
-
- _________________________________________________________________
-
-
- ===========================================================================
-
-
- FAQ SECTION 3 - Speech Coding and Compression
-
- Q3.1: SPEECH COMPRESSION TECHNIQUES.
-
- Can anyone provide a 1-2 page summary on speech compression?
-
- Note: the FAQ for comp.compression includes a few questions and
- answers on the compression of speech.
- _________________________________________________________________
-
- Q3.2: WHAT ARE SOME GOOD REFERENCES/BOOKS ON CODING/COMPRESSION?
- * Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal
- Processing, 1987.
- * Bishnu Atal in ed. Fallside, F. and W. Woods, ed. Computer Speech
- Processing. London: Prentice/Hall International, 1985.
- * Makhoul, J. "Linear Prediction: A Tutorial Review." Proc. of the
- IEEE 63 (1975): 561 - 580.
-
- _________________________________________________________________
-
- Q3.3: WHAT SPEECH COMPRESSION/CODING SOFTWARE IS AVAILABLE?
-
- Note: there are two types of speech compression technique referred to
- below. Lossless technqiues preserve the speech through a
- compression-decompression phase. Lossy techniques do not preserve the
- speech prefectly. As a general rule, the more you compress speech, the
- more the quality degardes.
-
- File format conversion
- * Platform: SUN OS?
- * Description: Conversion utility able to encode and decode
- between the the following formats: G.723, G.721, A-law, u-law and
- linear.
- * Availability: By anonymous ftp from
- + ftp://ftp.cwi.nl/pub/audio/ccitt-adpcm.tar.Z
-
- shorten - a lossless compressor for speech signals
- * Platform: UNIX/DOS
- * Description: A fast waveform coder suitable for a speech and
- music signals in a wide variety of file formats. The degree of
- compression is adjustable from lossless to three bits a sample.
- 16bit 16kHz speech generally attains 50% lossless compression and
- 16:3 compression of CDROM quality speech is obtainable with only
- minor audiable degredation.
- * Availability: Anonymous ftp - UNIX and DOS versions are in
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/shorten-1.
- 14.tar.Z
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/shn114.zip
-
- 32 kbps ADPCM
- * Platform: SGI and Sun Sparcs
- * Description: 32 kbps ADPCM C-source code (G.721 compatibility is
- uncertain)
- * Contact: Jack Jansen
- * Availablity: Anoymous ftp
- + ftp://ftp.cwi.nl/pub/adpcm.shar
-
- GSM 06.10 Compression
- * Platform: Unix; faster than real time on most Sun SPARCstations
- * Description: GSM 06.10 is a standardized lossy speech
- compression employed by most European wireless telephones. It uses
- RPE/LTP (residual pulse excitation/long term prediction) coding to
- compress frames of 160 13-bit samples (8 kHz sampling rate, i.e. a
- frame rate of 50 Hz) into 260 bits.
- * Contact: GSM 06.10 support and implementation
- jutta@cs.tu-berlin.de, cabo@cs.tu-berlin.de
- * Availability: The following configurations are available be
- anonymous ftp:
- + gzip compression from Germany:
- ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/gsm-1.0.5.
- tar.gz
- + MS-DOS compression from Germany:
- ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/gsm-105.zi
- p
- + MS-DOS compression from USA:
- ftp://ftp.mv.com/pub/ddj/1194.12/gsm-105.zip
- * Misc: The WWW site is
- + http://www.cs.tu-berlin.de/~jutta/toast.html
-
- G.711/721/723 Compression
- * Description:
- + G.711 : CCITT u-law and A-law compression
- + G.721 : CCITT 32 kbps ADPCM coder
- + G.723 : CCITT 24 kbps and 40 kbps ADPCM coders
- * Availability: By email to teledoc@itu.arcom.ch, with
- GET ITU-3022
- as the *only* line in the body of the message. This is also available
- by anonymous ftp from:
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/G711_G721_
- G723.tar.Z
-
- G.728 Compression
- * Description: G.728 low delay celp package written by Alex
- Zatsman of Analog Devices, Inc.
- * Availability: By anonymous ftp from
- + ftp://dspsun.eas.asu.edu/pub/speech/ldcelp.tgz
-
- G.728 LD-CELP vocoder
- * Platform: Analog Devices ADSP-2171
- * Description: Real-time, full-duplex G.728 LD-CELP vocoder that
- runs on a single Analog Devices ADSP-2171. Source and object code
- available for a one-time license fee.
- * Contact:
- Cole Erskine
- Analogical Systems
- 299 California Avenue, Suite 120
- Palo Alto, CA 94306, USA
- Tel:(415) 323-3232 FAX:(415) 323-4222
- Internet: cole@analogical.com
-
- U.S.F.S. 1016 CELP vocoder for DSP56001
- * Platform: DSP56001
- * Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a
- single 27MHz Motorola DSP56001. Free demo software available for
- PC-56 and PC-56D. Source and object code available for a one-time
- license fee.
- * Contact:
- Cole Erskine
- Analogical Systems
- 299 California Avenue, Suite 120
- Palo Alto, CA 94306, USA
- Tel:(415) 323-3232 FAX:(415) 323-4222
- Email: cole@analogical.com
-
- 8 Kbit/s CELP on the TMS320C5x family of DSP chips
- * Description: For low bandwidth transmission of voice, compact
- voice storage for archival purposes, low-cost digital answering
- machines and efficient storage for voice mail. Features :
- + near toll quality at 8 Kb/s.
- + Variable rate option with 1 Kb/s silence encoding.
- + Implemented on a fixed-point processor for lower system cost.
- + Attractive licensing scheme.
- + Future availability of 4 Kb/s.
- + Custom rates possible.
- Capacity :
- + Two half-duplex or one full duplex channels on the 20 MIPS
- 'C5x (at 95% and 55% CPU utilization respectively).
- + Two full duplex channels on the 28.6 MIPS 'C5x (at 77% CPU
- utilization).
- + Requires 9 K-words program memory and 3 K-words data memory.
- + Decoding in real-time on a 486 class CPU.
- * Contact:
- CVI Inc.
- 443 Vienna Cres. North Vancouver, BC, Canada V7N 3B3
- Tel: (604) 987 1719 Fax: (604) 986 8139
- Email: cvi@extropia.wimsey.com
-
- CELP 3.2a & LPC
- * Platform: Sun (the makefiles & source can be modified for other
- platforms)
- * Description: CELP is lossy compression technqiue. The U.S. DoD's
- Federal-Standard-1016 based 4800 bps code excited linear
- prediction voice coder version 3.2a (CELP 3.2a) Fortran and C
- simulation source codes. Available for worldwide distribution (on
- DOS diskettes, but configured to compile on Sun SPARC stations)
- from NTIS and DTIC. Example input and processed speech files are
- included. A Technical Information Bulletin (TIB), "Details to
- Assist in Implementation of Federal Standard 1016 CELP," and the
- official standard, "Federal Standard 1016, Telecommunications:
- Analog to Digital Conversion of Radio Voice by 4,800 bit/second
- Code Excited Linear Prediction (CELP)," are also available.
- * Availability 1: Through the National Technical Information
- Service:
- NTIS
- U.S. Department of Commerce
- 5285 Port Royal Road, Springfield, VA 22161, USA
-
- The "AD" ordering number for the CELP software is AD M000 118 (US$
- 90.00) and for the TIB it's AD A256 629 (US$ 17.50). The LPC-10
- standard, described below, is FIPS Pub 137 (US$ 12.50). There is a
- $3.00 shipping charge on all U.S. orders. The telephone number for
- their automated system is 703-487-4650, or 703-487-4600 if you'd
- prefer to talk with a real person.
-
- (U.S. DoD personnel and contractors can receive the package from
- the Defense Technical Information Center: DTIC, Building 5,
- Cameron Station, Alexandria, VA 22304-6145. Their telephone number
- is 703-274-7633.)
- * Availability 2: By anonymous ftp from:
- + ftp://ftp.super.org(192.31.192.1)/pub/celp_3.2a.tar.Z
- + OR
- ftp://svr-ftp.eng.cam.ac.uk/comp.speech/sources/celp_3.2a.tar
- .Z
- * Misc: The following articles describe the Federal-Standard-1016
- 4.8-kbps CELP coder (it's unnecessary to read more than one):
- + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C.
- Welch, "The Federal Standard 1016 4800 bps CELP Voice Coder,"
- Digital Signal Processing, Academic Press, 1991, Vol. 1, No.
- 3, p. 145-155.
- + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C.
- Welch, "The DoD 4.8 kbps Standard (Proposed Federal Standard
- 1016)," in Advances in Speech Coding, ed. Atal, Cuperman and
- Gersho, Kluwer Academic Publishers, 1991, Chapter 12, p.
- 121-133.
- + Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C.
- Welch, "The Proposed Federal Standard 1016 4800 bps Voice
- Coder: CELP," Speech Technology Magazine, April/May 1990, p.
- 58-64.
-
- The U.S. DoD's Federal-Standard-1015/NATO-STANAG-4198 based 2400
- bps linear prediction coder (LPC-10) was republished as a Federal
- Information Processing Standards Publication 137 (FIPS Pub 137).
- It is described in:
- + Thomas E. Tremain, "The Government Standard Linear Predictive
- Coding Algorithm: LPC-10," Speech Technology Magazine, April
- 1982, p. 40-49.
-
- There is also a section about FS-1015 in the book:
- + Panos E. Papamichalis, Practical Approaches to Speech Coding,
- Prentice-Hall, 1987.
-
- The voicing classifier used in the enhanced LPC-10 (LPC-10e) is
- described in:
- + Campbell, Joseph P., Jr. and T. E. Tremain, "Voiced/ Unvoiced
- Classification of Speech with Applications to the U.S.
- Government LPC-10E Algorithm," Proceedings of the IEEE
- International Conf. on Acoustics, Speech, and Signal
- Processing, 1986, p. 473-6.
- Copies of the official standard, "Federal Standard 1016, Tele-
- communications: Analog to Digital Conversion of Radio Voice by
- 4,800 bit/second Code Excited Linear Prediction (CELP)" are
- available for US$ 5.00 each from:
- GSA Federal Supply Service Bureau
- Specification Section, Suite 8100
- 470 E. L'Enfant Place, S.W.
- Washington, DC 20407
- (202)755-0325
- Realtime DSP code for FS-1015 and FS-1016 is sold by:
- John DellaMorte, DSP Software Engineering
- 165 Middlesex Tpk, Suite 206, Bedford, MA 01730, USA
- Ph: 1-617-275-3733 Fax: 1-617-275-4323
- dspse.bedford@channel1.com
- DSP Software Engineering's FS-1016 code can run on a DSP Research's
- Tiger 30 (a PC board with a TMS320C3x and analog interface suited
- to development work).
- DSP Research
- 1095 E. Duane Ave, Sunnyvale, CA 94086, USA
- Ph: (408)773-1042 Fax: (408)736-3451
-
- _________________________________________________________________
-
-
- ===========================================================================
-
-
- FAQ SECTION 4 - Natural Language Processing
-
- There is now a newsgroup specifically for Natural Language Processing.
- It is called comp.ai.nat-lang.
-
- There is also a lot of useful information on Natural Language
- Processing in the FAQ for comp.ai. That FAQ lists available software
- and useful references. It includes a substantial list of software,
- documentation and other info available by ftp.
- _________________________________________________________________
-
- Q4.1: WHAT ARE SOME GOOD REFERENCES/BOOKS ON NLP?
-
- Take a look at the FAQ for the "comp.ai" newsgroup as it also includes
- some useful references.
- * James Allen: Natural Language Understanding, (Benjamin/Cummings
- Series in Computer Science) Menlo Park: Benjamin/Cummings
- Publishing Company, 1987.
- + This book consists of four parts: syntactic processing,
- semantic interpretation, context and world knowledge, and
- response generation.
- * G. Gazdar and C. Mellish, Natural Language Processing in Prolog,
- Addison Wesley, 1989
- * G. Gazdar and C. Mellish, Natural Language Processing in Lisp,
- Addison Wesley, 1989
- * G. Gazdar and C. Mellish, Natural Language Processing in Pop11,
- Addison Wesley, 1989
- + Emphasis on parsing, especially unification-based parsing,
- lots of details on the lexicon, feature propagation, etc.
- Fair coverage of semantic interpretation, inference in
- natural language processing, and pragmatics; much less
- extensive than in Allen's book, but more formal. There are
- three versions, one for each programming language listed
- above, with complete code.
- * Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1
- and 2. New York: John Wiley & Sons, 1990.
- + There are articles on the different areas of natural language
- processing which also give additional references.
- * Paris, Ce'cile L.; Swartout, William R.; Mann, William C.:
- Natural Language Generation in Artificial Intelligence and
- Computational Linguistics. Boston: Kluwer Academic Publishers,
- 1991.
- + The book describes the most current research developments in
- natural language generation and all aspects of the generation
- process are discussed. The book is comprised of three
- sections: one on text planning, one on lexical choice, and
- one on grammar.
- * Readings in Natural Language Processing, ed by B. Grosz, K.
- Sparck Jones and B. Webber, Morgan Kaufmann, 1986
- + A collection of classic papers on Natural Language
- Processing. Fairly complete at the time the book came out
- (1986) but now seriously out of date. Still useful for ATN's,
- etc.
- * Klaus K. Obermeier, Natural Language Processing Technologies in
- Artificial Intelligence: The Science and Industry Perspective,
- Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989.
-
- Journals
-
- The major journals of the field are
- * Computational Linguistics and Cognitive Science for the
- artificial intelligence aspects,
- * Cognition for the psychological aspects,
- * Language and Linguistics and Philosophy and Linguistic
- Inquiry for the linguistic aspects.
- * Artificial Intelligence occasionally has papers on natural
- language processing.
-
- Conferences
-
- The major conferences of the field are
- * ACL (held every year)
- * and COLING (held every two years). Most AI conferences have a NLP
- track; AAAI, ECAI, IJCAI and the Cognitive Science Society
- conferences usually are the most interesting for NLP. CUNY is an
- important psycholinguistic conference. There are lots of
- linguistic conferences: the most important seem to be NELS, the
- conference of the Chicago Linguistic Society (CLS), WCCFL, LSA,
- the Amsterdam Colloquium, and SALT.
-
- _________________________________________________________________
-
- Q4.2: WHAT NLP SOFTWARE IS AVAILABLE?
-
- Check the comments at the start of this section for information on
- other newsgroups and sources of information on NLP.
-
- Natural Language Software Registry (NLSR) - NLP Tools
- * The Natural Language Software Registry is available from the
- German Research Institute for Artificial Intelligence (DFKI) in
- Saarbrucken. Its purpose is to facilitate the exchange and
- evaluation of natural language processing software within the
- research community. To this end, the NLSR is cataloging natural
- language software projects, both commercial and non- commercial.
- The new updated and enlarged version contains more than 100
- descriptions of natural processing software. Registry listings
- include:
- + speech signal processors, such as the Computerized Speech Lab
- (Kay Elemetrics)
- + morphological analyzers, such as PC-KIMMO (Summer Institute
- for Linguistics)
- + parsers, such as Alveytools (University of Edinburgh)
- + semantic and pragmatic analyzer, such as NLL (University of
- the Saarland, Germany)
- + generation programs, such as FUF (Ben Gurion University of
- the Negev)
- + knowledge representation systems, such as Rhet (University of
- Rochester)
- + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI),
- Pundit (UNISYS), SNePS (SUNY Buffalo),
- + NLP-Tools, such as GULP (University of Georgia) or Linguist
- (Kansai Research Laboratory)
- + applications programs (misc.)
- * If you have developed a piece of software for natural language
- processing that other researchers might find useful, you can
- include it by returning the questionnaire available from the
- sources below.
- * ftp://ftp.dfki.uni-sb.de/pub/registry
- * e-mail: registry@dfki.uni-sb.de
- * post:
- Natural Language Software Registry
- Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI)
- Stuhlsatzenhausweg 3
- D-66123 Saarbruecken
- Germany
- * Other ftp sites are
- + ftp://crlftp.nmsu.edu/pub/non-lexical/NL_Software_Registy
- + ftp://dri.cornell.edu/pub/Natural_Language_Software_Registry
-
- Part of Speech Tagger
- * Description: A rule-based part pf speech tagger developed by
- Eric Brill. For a detailed description of the tagger see chapter 6
- of his thesis.
- * Availability: The tagger and description are available by
- anonymous ftp from
- + ftp://lightning.lcs.mit.edu/pub/BRILL/Programs & Papers
-
- _________________________________________________________________
-
-
-
-
- Andrew Hunt
- ---
- Speech Technology Research Group Ph: 61-2-351 4509
- Dept. of Electrical Engineering Fax: 61-2-351 3847
- University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au
- Archive-name: comp-speech-faq/part3
- Last-modified: 1995/01/19
-
-
- COMP.SPEECH FAQ POSTING - PART 3/3
-
-
- [Note: this document has been automatically extracted from a WWW site:
- http://www.speech.su.oz.au/comp.speech
- This may introduce some formatting errors.]
-
-
-
- ===========================================================================
-
-
- FAQ SECTION 5 - Speech Synthesis
-
- Q5.1: WHAT IS SPEECH SYNTHESIS?
-
- Speech synthesis is the task of transforming written input to spoken
- output. The input can either be provided in a graphemic/orthographic
- or a phonemic script, depending on its source.
- _________________________________________________________________
-
- Q5.2: HOW CAN SPEECH SYNTHESIS BE PERFORMED?
-
- There are several algorithms. The choice depends on the task they're
- used for. The easiest way is to just record the voice of a person
- speaking the desired phrases. This is useful if only a restricted
- volume of phrases and sentences is used, e.g. messages in a train
- station, or schedule information via phone. The quality depends on the
- way recording is done.
-
- More sophisticated but worse in quality are algorithms which split the
- speech into smaller pieces. The smaller those units are, the less are
- they in number, but the quality also decreases. An often used unit is
- the phoneme, the smallest linguistic unit. Depending on the language
- used there are about 35-50 phonemes in western European languages,
- i.e. there are 35-50 single recordings. The problem is combining them
- as fluent speech requires fluent transitions between the elements. The
- intellegibility is therefore lower, but the memory required is small.
-
- A solution to this dilemma is using diphones. Instead of splitting at
- the transitions, the cut is done at the center of the phonemes,
- leaving the transitions themselves intact. This gives about 400
- elements (20*20) and the quality increases.
-
- The longer the units become, the more elements are there, but the
- quality increases along with the memory required. Other units which
- are widely used are half-syllables, syllables, words, or combinations
- of them, e.g. word stems and inflectional endings.
- _________________________________________________________________
-
- Q5.3: WHAT ARE SOME GOOD REFERENCES/BOOKS ON SYNTHESIS?
-
- The following are good introductory books/articles.
- * Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal
- Processing, 1987.
- * D. H. Klatt, "Review of Text-To-Speech Conversion for English",
- Jnl. of the Acoustic Society of America (JASA), v82, Sept. 1987,
- pp 737-793.
- * "Talking Machines, Theories, Models and Designs" Eds, G. Bailly &
- C. Benoit (Elsevier: North Holland)
- * I. H. Witten. Principles of Computer Speech. (London: Academic
- Press, Inc., 1982).
- * John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to
- Speech: The MITalk System", Cambridge University Press, 1987.
-
- _________________________________________________________________
-
- Q5.4: WHAT SPEECH SYNTHESIS SOFTWARE/HARDWARE IS AVAILABLE?
-
- Please email any updates, corrections or additions to the following
- list. The range of commercially available synthesis software is
- growing rapidly so any help in keeping up to date will be appreciated.
-
- Orator Text-to-Speech Synthesizer
- * Platform: SUN SPARC, Decstation 5000. Written in C, and
- therefore portable to other UNIX platforms. Some successful ports:
- HP, RS-6000, PC-Unix [Linux].
- * Description: Sophisticated speech synthesis package. Has text
- preprocessing (for abbreviations, numbers), acronym rules, and
- human-like spelling routines. Natural-sounding synthesis based on
- demisyllable concatenation.
-
- Has high accuracy for pronunciation of names of people, places and
- businesses in America; good accuracy for English text; rules for
- stress and intonation marking; various methods of user control and
- customization at most stages of processing.
-
- A new version of the ORATOR system is under development. Both
- ORATOR and this new "ORATOR II" system are capable of very good
- general text synthesis. The ORATOR II system has a more
- natural-sounding voice.
- * Hardware: Runs on common SPARC or Decstation workstations, using
- their internal audio output capability. Recommend at least 16M of
- memory.
- * Availability and Pricing: Contact Bellcore's Licensing Office
- (1-800-527-1080) or email Anthony Lindsey alin1@panix.com
-
- Text to phoneme program (1)
- * Platform: unknown
- * Description: Text to phoneme program. Based on Naval Research
- Lab's set of text to phoneme rules.
- * Availability: by anonymous ftp
- + ftp://shark.cse.fau.edu/pub/src/phon.tar.Z
-
- Text to phoneme program (2)
- * Platform: unknown
- * Description: Text to phoneme program.
- * Availability: by anonymous ftp
- + ftp://wuarchive.wustl.edu/mirrors/unix-c/utils/phoneme.c
-
- Text to phoneme program (3)
- * Description: A public domain version of the same Naval Research
- Lab text to phoneme rules.
- * Availability: By anonymous ftp
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/english2ph
- oneme.shar
-
- Text to speech program
- * Description: A implementation of the Klatt phoneme to waveform
- speech synthesiser.
- * Availability: By anonymous ftp
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/klatt-0.02
- .tar.Z
-
- "Speak" - a Text to Speech Program
- * Platform: Sun SPARC
- * Description: Text to speech program based on concatenation of
- pre-recorded speech segments. A function library can be used to
- integrate speech output into other code.
- * Hardware: SPARC audio I/O
- * Availability: by anonymous ftp
- + ftp://wilma.cs.brown.edu/pub/speak.tar.Z
-
- TheBigMouth - a Text to Speech Program
- * Platform: NeXT
- * Description: Text to speech program based on concatenation of
- pre-recorded speech segments. NeXT equivalent of "Speak" for Suns.
- * Availability: try NeXT archive sites such as
- sonata.cc.purdue.edu.
-
- TextToSpeech Kit
- * Platform: NeXT Computers
- * Description: The TextToSpeech Kit does unrestricted conversion
- of English text to synthesized speech in real-time. The user has
- control over speaking rate, median pitch, stereo balance, volume,
- and intonation type. Text of any length can be spoken, and
- messages can be queued up, from multiple applications if desired.
- Real-time controls such as pause, continue, and erase are
- included. Pronunciations are derived primarily by dictionary
- look-up. The Main Dictionary has nearly 100,000 hand-edited
- pronunciations which can be supplemented or overridden with the
- User and Application dictionaries. A number parser handles numbers
- in any form. A letter-to-sound knowledge base provides
- pronunciations for words not in the Main or customized
- dictionaries. Dictionary search order is under user control.
- Special modes of text input are available for spelling and
- emphasis of words or phrases. The actual conversion of text to
- speech is done by the TextToSpeech Server. The Server runs as an
- independent task in the background, and can handle up to 50 client
- connections.
- * Misc: The TextToSpeech Kit comes in two packages: the Developer
- Kit and the User Kit. The Developer Kit enables developers to
- build and test applications which incorporate text-to-speech. It
- includes the TextToSpeech Server, the TextToSpeech Object, the
- pronunciation editor PrEditor, several example applications,
- phonetic fonts, example source code, and developer documentation.
- The User Kit provides support for applications which incorporate
- text-to-speech. It is a subset of the Developer Kit.
- * Hardware: Uses standard NeXT Computer hardware.
- * Cost:
- + TextToSpeech User Kit: $175 CDN ($145 US)
- + TextToSpeech Developer Kit: $350 CDN ($290 US)
- + Upgrade from User to Developer Kit: $175 CDN ($145 US)
- * Availability: Trillium Sound Research
- 1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3
- Tel: (403) 284-9278 Fax: (403) 282-6778
- Order Desk: 1-800-L-ORATOR (US and Canada only)
- Email: TTSInfo@trillium.ab.ca
-
- SGI Developers Toolbox Synthesiser
- * Platform: SGI
- * Description: The SGI Developer Toolbox 4.0 CDROM contains a
- basic public domain text-to-speech program in the publics/speak
- directory. The directory includes man pages and source.
- * Availability: on the SGI Developer Toolbox 4.0 CDROM
-
- rsynth
- * Platform: Various (including Solaris2.3, SunOS4.1.3, HPUX, SGI
- Irix4.x, Linux)
- * Description:Public domain text-to-speech systm assembled from a
- variety of sources. It supports CMU and "beep" format dictionaries
- and now utilises stress marks in the dictionary in synthesising
- intonation.
- * Price: Free
- * Availability: by anonymous ftp from
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/rsynth-2.0
- .tar.Z
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/rsynth-2.0
- .tar.gz
-
- SENSYN speech synthesizer
- * Platform: PC, Mac, Sun, and NeXt
- * Rough Cost: $300
- * Description: This formant synthesizer produces speech waveform
- files based on the (Klatt) KLSYN88 synthesizer. It is intended for
- laboratory and research use. Note that this is NOT a
- text-to-speech synthesizer, but creates speech sounds based upon a
- large number of input variables (formant frequencies, bandwidths,
- glottal pulse characteristics, etc.) and would be used as part of
- a TTS system. Includes full source code.
- * Availability: Sensimetrics Corporation
- 64 Sidney Street, Cambridge MA 02139.
- Fax: (617) 225-0470; Tel: (617) 225-2442.
- Email: sensimetrics@sens.com
-
- spchsyn.exe
- * Platform: PC?
- * Availability: By anonymous ftp as a self extracting DOS archive.
- +
- ftp://evans.ee.adfa.oz.au/mirrors/tibbs/applications/spchsyn.e
- xe
- * Requirements: May require special TI product(s), but all source
- is there.
-
- CSRE: Canadian Speech Research Environment
- * Platform: PC
- * Cost: Distributed on a cost recovery basis.
- * Description: CSRE is a software system which includes in
- addition to the Klatt speech synthesizer, SPEECH ANALYSIS and
- EXPERIMENT CONTROL SYSTEM. A paper about the whole package can be
- found in:
- + Jamieson D.G. et al, "CSRE: A Speech Research Environment",
- Proc. of the Second Intl. Conf. on Spoken Language
- Processing, Edmonton: University of Alberta, pp. 1127-1130.
- * Hardware: Can use a range of data aqcuisition/DSP hardware.
- * Availability: For more information contact
- Krystyna Marciniak
- email march@uwovax.uwo.ca
- Tel (519) 661-3901 Fax (519) 661-3805.
- For technical information email ramji@uwovax.uwo.ca
- * Note: A more detailed description is given in Section 1.9 on
- speech environments.
-
- Eloquence (currently an alpha release)
- * Platform: Windows and Solaris
- * Description: Software based text-to-speech package. Generates
- waveforms completely algorithmically instead of by concatenating
- waveforms, for maximum flexibility and naturalism. For instance,
- when the user requests a deeper voice, the software simulates a
- larger vocal tract, instead of simply pitch-shifting samples.
-
- Uses high-level linguistic parsing, which obviates the need for a
- huge dictionary. Handles numbers, acronyms, currency, etc.
- Includes a set of annotation symbols, for placing stress on
- particular words, expressing excitement/boredom, etc. Also allows
- phonetic input. The final version, including support for Windows
- DDE and OLE and UNIX Sockets, will be released by the end of 1994.
-
- Produces male and female voices for General American English.
- Dialects under development include Alabama, Brooklyn, and Boston.
- * Price: $5000 (unconfirmed)
- * Availability:
- Eloquent Technology, Inc.
- 2389 North Triphammer Road
- Ithaca, NY 14850
- Ph: (607) 607-266-7025 Fax: (607) 607-266-7030
- Email: eti@plab.dmll.cornell.edu
-
- JSRU
- * Platform: UNIX and PC
- * Cost: 100 pounds sterling (from academic institutions and
- industry)
- * Description: A C version of the JSRU system, Version 2.3 is
- available. It's written in Turbo C but runs on most Unix systems
- with very little modification. A Form of Agreement must be signed
- to say that the software is required for research and development
- only.
- * Contact: Dr. E.Lewis eric.lewis@bristol.ac.uk)
-
- Klatt-style synthesiser
- * Platform: Unix
- * Cost: Free
- * Description: Software posted to comp.speech in late 1992.
- * Availability: By anonymous ftp from the comp.speech archives
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources/klatt-0.02
- .tar.Z
-
- DECTalk
- * Description: Speech synthesis hardware and software. Detailed
- information on DECtalk and other DEC products is available on a
- World-Wide Web site.
- + http://www.digital.com/info.html
- For specific information on DECtalk, check out this www url:
- +
- http://www.digital.com/archive/pub/Digital/info/Customer-Updat
- e/940620005.txt
-
- Speech Manager and PlainTalk
- * Platform: Macintosh
- * Cost: Free
- * Description: Apple's new text-to-speech system extension(s) that
- enable applications (listed below) to perform text-to-speech
- conversion. The Speech Manager runs on most Macs, but PlainTalk
- (and the high quality voices) requires a 68020 Mac or better.
- * Availability: By anonymous ftp from:
- + ftp://ftp.apple.com/dts/mac/sys.soft/speech
- There are 3 files in this directory:
-
- 6273632 Aug 14 22:51 macintalk-pro.hqx
- PlainTalk Text-To-Speech 1.0 speech synthesizer extension
- (includes Female Voice, Compressed); TTS Female Voice;
- TTS Male Voice; and TTS Male Voice, Compressed. Requires
- 68020 or better!
-
- 370108 Aug 13 04:30 speech-manager-docs.hqx
- Apple DocViewer format (Inside Macintosh style, no
- installation instructions - just drag everything onto
- your closed System Folder).
-
- 262569 Aug 7 07:01 speech-manager.hqx
- Speech Manager 1.1.1 (includes Marvin's voice) and
- MacInTalk Voices 1.1.1 (9 more voices). Runs most Macs.
-
- Various Mac Speech Output Applications
- * Platform: Macintosh
- * Cost: Free (except for At Ease)
- * Description: Some of the Speech Manager aware text-to-speech
- (TTS) applications, etc. are listed below (there are more on the
- Apple Developer CD-ROMs).
- Application, etc. Source Comments
- _________________ ________ _________________________________________________
- AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0)
- At Ease 2.0 MacWarehouse Friendly desktop that speaks file names
- At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names
- Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist)
- FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo
- FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo
- Fortune 1.1 info-mac Like a talking UNIX fortune command - slick
- Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick
- MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup
- Say info-mac MPW Tool which converts standard input to speech
- ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages
- Siege Watch 1.01f info-mac Wryly political speaking clock
- SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About)
- Speak It! info-mac Type in a message and have it spoken
- Speaker 1.11 info-mac Simple text file editor, speaks on CR, macros
- Speecher 1.2.1 info-mac Customizable word pronunciation/substitution
- SpeechManagerdemo info-mac Command line interface, C source, aka -explorer
- Speech Pack 2.0 info-mac 4th Dimension external, add speech to database
- SpeechUnitEx info-mac Pascal source code for speech in Lab 7
- speek-02b info-mac Speech XCMD for HyperCard
- TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0)
- TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT)
- Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming
- VoiceDemo 1.0.1 info-mac Bare bones phrase talker
- Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup
- ? ? Talking Plug-In-Module for MS Word 5,
- experimental, unsupported, buggy, beware!
- Speech Rhythms AOL A cool text file for one of the above apps
- _____
- * Sources:
- + AOL = America Online
- + info-mac = {ftp sumex-aim.stanford.edu, ftp
- wuarchive.wustl.edu, et al.}
- + MacWarehouse = (800) 255-6227
- * Misc: Apple's work in spoken language technologies and systems
- is described in:
- + Lee, Kai-Fu. "The Conversational Computer: An Apple
- Perspective." (Keynote Speech) In Proc. Eurospeech in Berlin,
- September, 1993.
-
- MacinTalk
- * Platform: Macintosh
- * Cost: Free
- * Description: Formant based speech synthesis. There is also a
- program called "tex-edit" which apparently can pronounce English
- sentences reasonably using Macintalk.
- * Note: MacinTalk doesn't run reliably on Macintosh's with new
- sound hardware under the lastest OS (System 7.1 w/HUD 2.0). More
- recent software is listed above.
- * Availability: By anonymous ftp from many archive sites (have a
- look on archie if you can). tex-edit is on many of the same sites.
- Try
- +
- ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintalk
- .hqx
- +
- ftp://wuarchive.wustl.edu/mirrors2/info-mac/Old/card/macintalk
- -stack.hqx
- +
- ftp://wuarchive.wustl.edu/mirrors2/info-mac/app/tex-edit-15.hq
- x
-
- Monologue by Creative Labs
- * Platform: PC Windows plus SoundBlaster 16
- * Cost: $99.00 or free with some MultiMedia packages
- * Description: Phoneme based speech synthesis software which
- provides output on Sound Blaster compatible audio cards. It
- includes a dictionary of words that are "exceptions" together with
- a a dictionary manager for modifying those words. It can be used
- as a stand alone program with Windows' Clipboard or as a DDE
- server dynamically linked (DLL) to a program you write.
- * Cost: $99.00 or free with some MultiMedia packages
- * Contact:
- Creative Labs Inc.
- 1901 McCarthy Boul, Milpitas, CA 95035, USA
- Tel: 408-428-6622 Fax: 408-428-6633 BBS: 408-428-6660
- OR Creative Technology Ltd.
- 67 Ayer Rajah Crescent #03-18, Singapore 0513
- Tel: 65-870-0433 Fax: 65-773-0353 BBS: 65-776-2423
-
- Lernout & Hauspie Text-To-Speech SDK
- * Platform: IBM-Compatible
- * Description: The L&H; Text-to-Speech software developers kit is
- able to integrate text-to-speech technology with your own or
- existing PC applications under Microsoft Windows 3.1. This
- software will allow conversion of written text into clear human
- sounding synthetic speech.
- * Requirements: IBM-compatible PC 386 DX(33Mhz) or higher, 8Mb
- RAM, MS DOS 5.0(or higher), MS Windows 3.1 (or higher), Compiler
- and linker: Microsoft(R) Visual C++ or Borland C++, Windows(TM)
- 3.1 compatible sound card, preferably 16 bit e.g. Soundblaster,
- Windows Sounds System, Pro Audio Spectrum
- * Price: Unconfirmed $1,999 per copy, and $499 per each additional
- language (American English, French, German, or Spanish).
- * Contact: USA (617) 932-4118
-
- Tinytalk
- * Platform: PC
- * Description: Shareware package is a speech 'screen reader' which
- is used by many blind users.
- * Availability: By anonymous ftp
- + ftp://handicap.shel.isc-br.com/speech
- Get the files ttexe166.zip and ttdoc166.zip.
-
- Narrator - narrator.device
- * Platform: Amiga
- * Description: Formant based speech synthesis. Includes a
- Engish-to-phoneme translation library, and a SPEAK: pseudo-device
- for speech output.
- * Hardware: Standard Amiga hardware
- * Availability: Part of AmigaOS
-
- Infovox Product Range
- * Description: Multilingual Text-to-speech systems, languages
- available: American English, British English, German, French,
- Spanish, Italian, Swedish, Norwegian, Icelandic, Danish and
- Finnish.
-
- * Product name: INFOVOX 500, PC BOARD
- + Product description: Half length expansion board for IBM PC,
- XT, AT, PS/2 model 30 or compatible personal computers. The
- board can also be connected via the serial port. Language and
- control program for downloading into RAM or mounted on
- EPROMs.
- + Platform: for IBM PC, XT, AT, PS/2 model 30 or compatible
- * Product name: INFOVOX 600, OEM BOARD
- + Product description: OEM board built with CMOS IC's. Language
- and control program are stored in on-board fixed memory.
- + Platform: any, Interface: 9-pole D-SUB (RS 232-C) 300-9600
- Baud
- * Product name: INFOVOX 700, DESKTOP UNIT
- + Product description: Desktop unit with built in Infovox 600
- to be connected to any computer or terminal via an RS 232-C
- serial interface. Built in loudspeaker and rechargable
- battery for 4 hours use, and control knobs for continuous
- control of speech volume and speed.
- + Platform: any
- * Product name: INFOVOX 650, OEM BOARD
- + Product description: OEM-board built with CMOS IC's. Language
- and control program are stored in on-board memory.
- + Platform:any, Interface: 9 pole D-SUB (RS 232-C) 300-9600
- Baud
- * Product name: INFOVOX 750, DESKTOP UNIT
- + Product description: Desktop unit with built in Infovox 650
- to be connected to any computer or terminal via an RS 232-C
- serial interface. Built in loudspeaker and rechargable
- battery for 5 hours use, and a control knob for continuous
- control of speech volume.
- + Platform: any
- * Misc: Infovox multi-lingual Text-to-Speech Technologies can
- interface with Apple's PlainTalk System. It enables Apple Third
- party developers to write application software with synthetic
- speech output using their usual Apple Plain Talk Text-to-Speech
- interface. Software already written for the English speaking
- market using Apple Plain Talk can be now distributed worldwide,
- provided message strings are translated.
- * Contact:
- Telia Promotor Infovox AB
- TTS Sales Division
- P.O. Box 2069
- S-171 02 Solna, Sweden
- Ph: +46 8 764 35 00 Fax: +46 8 735 78 76
- email: tts-sales@infovox.se
-
- SIMTEL-20
- * The following is a list of speech related software available from
- SIMTEL-20 and its mirror sites for PCs.
- * The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil
- [192.88.110.20] Try looking at your nearest archive site first.
- [Note: problems have been reported in accessing this site - does
- anyone know a new address?]
- Directory PD1: MSDOS.VOICE
- Filename Type Length Date Description
- ==============================================
- AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
- CVOICE.ARC B 21335 891113 Tells time via voice response on PC
- HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
- HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
- SAY.ARC B 20224 860330 Computer Speech - using phonemes
- SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
- TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
- TRAN.ARC B 39766 890715 Repeats typed text in digital voice
- VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
- VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
-
- _________________________________________________________________
-
-
- ===========================================================================
-
-
- FAQ SECTION 6 - Speech Recognition
-
- Q6.1: WHAT IS SPEECH RECOGNITION?
-
- Automatic speech recognition is the process by which a computer maps
- an acoustic speech signal to text.
-
- Automatic speech understanding is the process by which a computer maps
- an acoustic speech signal to some form of abstract meaning of the
- speech.
- _________________________________________________________________
-
- Q6.2: HOW CAN I BUILD A VERY SIMPLE SPEECH RECOGNISER?
-
- Doug Danforth provides a detailed account in article 253 in the
- comp.speech archives. A summary is provided below. It is also
- available by anonymous ftp
- *
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/info/DIY_SpeechRecognit
- ion
-
- QUICKY RECOGNIZER sketch:
-
- Here is a simple recognizer that should give you 85%+ recognition
- accuracy. The accuracy is a function of the words you have in your
- vocabulary. Long distinct words are easy. Short similar words are
- hard. You can get 98+% on the digits with this recognizer.
-
- Overview:
- * Find the begining and end of the utterance.
- * Filter the raw signal into frequency bands.
- * Cut the utterance into a fixed number of segments.
- * Average data for each band in each segment.
- * Store this pattern with its name.
- * Collect training set of about 3 repetitions of each pattern
- (word).
- * Recognize unknown by comparing its pattern against all patterns in
- the training set and returning the name of the pattern closest to
- the unknown.
-
- Many variations upon the theme can be made to improve the performance.
- Try different filtering of the raw signal and different processing
- methods.
-
- Q6.7 contains information on public domain speech recognition
- software: Lotec and Myers' Hidden Markov Model software.
- _________________________________________________________________
-
- Q6.3: WHAT DOES SPEAKER DEPENDENT/ADAPTIVE/INDEPENDENT MEAN?
-
- A speaker dependent system is developed to operate for a single
- speaker. These systems are usually easier to develop, cheaper to buy
- and more accurate, but not as flexible as speaker adaptive or speaker
- independent systems.
-
- A speaker independent system is developed to operate for any speaker
- of a particular type (e.g. American English). These systems are the
- most difficult to develop, most expensive and accuracy is lower than
- speaker independent systems. However, they are more flexible.
-
- A speaker adaptive system is developed to adapt its operation to the
- characteristics of new speakers. It's difficulty lies somewhere
- between speaker independent and speaker dependent systems.
- _________________________________________________________________
-
- Q6.4: WHAT DOES SMALL/MEDIUM/LARGE/VERY-LARGE VOCABULARY MEAN?
-
- The size of vocabulary of a speech recognition system affects the
- complexity, processing requirements and the accuracy of the system.
- Some applications only require a few words (e.g. numbers only), others
- require very large dictionaries (e.g. dictation machines). There are
- no established definitions, however, try
- * small vocabulary - tens of words
- * medium vocabulary - hundreds of words
- * large vocabulary - thousands of words
- * very-large vocabulary - tens of thousands of words.
-
- _________________________________________________________________
-
- Q6.5: WHAT DOES CONTINUOUS SPEECH OR ISOLATED-WORD MEAN?
-
- An isolated-word system operates on single words at a time - requiring
- a pause between saying each word. This is the simplest form of
- recognition to perform because the end points are easier to find and
- the pronunciation of a word tends not affect others. Thus, because the
- occurrences of words are more consistent they are easier to recognise.
-
- A continuous speech system operates on speech in which words are
- connected together, i.e. not separated by pauses. Continuous speech is
- more difficult to handle because of a variety of effects. First, it is
- difficult to find the start and end points of words. Another problem
- is "coarticulation". The production of each phoneme is affected by the
- production of surrounding phonemes, and similarly the the start and
- end of words are affected by the preceding and following words. The
- recognition of continuous speech is also affected by the rate of
- speech (fast speech tends to be harder).
- _________________________________________________________________
-
- Q6.6: HOW IS SPEECH RECOGNITION PERFORMED?
-
- A wide variety of techniques are used to perform speech recognition.
- There are many types of speech recognition. There are many levels of
- speech recognition / analysis / understanding.
-
- Typically speech recognition starts with the digital sampling of
- speech. The next stage is acoustic signal processing. Most techniques
- include spectral analysis; e.g. LPC analysis, MFCC, cochlea modelling
- and many, many more.
-
- The next stage is recognition of phonemes, groups of phonemes and
- words. This stage can be achieved by many processes such as DTW
- (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural
- Networks), expert systems and combinations of techniques. HMM-based
- systems are currently the most commonly used and most successful
- approach.
-
- Most systems utilise some knowledge of the language to aid the
- recognition process.
-
- Some systems try to "understand" speech. That is, they try to convert
- the words into a representation of what the speaker intended to mean
- or achieve by what they said.
- _________________________________________________________________
-
- Q6.7: WHAT ARE SOME GOOD REFERENCES/BOOKS ON SPEECH RECOGNITION?
-
- Some reviews of speech recognition for personal computers:
- * "Seybold Report on Desktop Publishing" published a nine-page,
- head-to-head comparison of Dragon's DOS software with IBM's OS/2
- software. March 7, 1994; Volume 8, Number 7; Pages 3-11;
- ISSN:0889-9762; Seybold Publications, P.O. Box 644, Media, PA
- 19063 USA, phone (610) 565-2480.
- * McGraw-Hill Inc.'s "BYTE, the Magazine of Technology Integration,"
- published a two-page review of IBM's Personal Dictation System
- software. May 1994; Volume ?, Number ?; Pages 145-146;
- ISSN:0360-5280; Editorial, Executive, and Circulation address: One
- Phoenix Mill Lane, Peterborough, NH 03458 USA, phone ?
-
- Some general introduction books on speech recognition technology:
- * Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang
- Juang Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing
- Series), c1993 ISBN 0-13-015157-2
- * Speech recognition by machine; W.A. Ainsworth London: Peregrinus
- for the Institution of Electrical Engineers, c1988
- * Speech synthesis and recognition; J.N. Holmes Wokingham: Van
- Nostrand Reinhold, c1988
- * Douglas O'Shaughnessy -- Speech Communication: Human and Machine
- Addison Wesley series in Electrical Engineering: Digital Signal
- Processing, 1987.
- * Electronic speech recognition: techniques, technology and
- applications edited by Geoff Bristow, London: Collins, 1986
- * Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu
- Lee. San Mateo: Morgan Kaufmann, c1990
-
- More specific books/articles:
- * Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki,
- M.A. Jack. Edinburgh: Edinburgh University Press, c1990
- * Automatic speech recognition: the development of the SPHINX
- system; by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989
- * Prosody and speech recognition; Alex Waibel (Pitman: London)
- (Morgan Kaufmann: San Mateo, Calif) 1988
- * S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction
- to the Application of the Theory of Probabilistic Functions of a
- Markov Process to Automatic Speech Recognition" in Bell Syst.
- Tech. Jnl. v62(4), pp1035--1074, April 1983
- * R. P. Lippmann, "Review of Neural Networks for Speech
- Recognition", in Neural Computation, v1(1), pp 1-38, 1989.
-
- _________________________________________________________________
-
- Q6.8: WHAT SPEECH RECOGNITION PACKAGES ARE AVAILABLE?
-
- The following packages are presented in no particular order.
-
- HM2007 - Speech Recognition Chip
- * Description: HM2007 is a 48-pin single chip CMOS voice
- recognition LSI circuit with on-chip analog front end, voice
- analysis, recognition process and system control functions. A 40
- word isolated-word voice recognition system can be composed of an
- external microphone, keyboard, SRAM and a few other components.
- When combined with a microprocessor, an intelligent recognition
- system can be built. A demo board for this chip is being
- distributed by The Summa Group.
- * Cost: Approx US$30 for the HM2007 and US$100 for the demo board.
- * Warning: Several people have reported problems in obtaining
- small numbers of this chip (say less than 10). It appears that the
- distributors (include the one listed below) are only interested in
- large volumes. If you know of a good source please send it in for
- inclusion in the FAQ.
- * Contact:
- The Summa Group Limited
- One California Street, Suite #1940,
- San Francisco, CA 94111
- Ph: (415) 288-0390
-
- Voice Blaster Ver. 4.0
- * Platform: IBM AT or higher, DOS or Wndows 3.1
- * Description: Uses a Sound Blaster or compatible board. Contains
- a microphone headset and a connector for LPT1:. A printer can
- still be used on LPT1:. Will recognize 1024 words that are trained
- by the operator. Each word activates a macro that can enter an
- ascii word on the screen or into a word processor or invoke a
- batch file. An optional footswitch may be installed. Software to
- run under DOS or Windows 3.1 is included.
- * Cost: Around $150 Canadian.
- * Contact:
- COVOX Inc.
- 675 Conger Street
- Eugene, Oregon, 97402, USA
- Ph: (503) 342-1271 Fax: (503) 342-1283
- BBS: (503) 342-4135
-
- Votan
- * Platform: MS-DOS, SCO UNIX
- * Description: Isolated word and continuous speech modes, speaker
- dependant and (limited) speaker independent. Vocab size is 255
- words or up to a fixed memory limit - but it is possible to
- dynamically load different words for effectively unlimited number
- of words.
- * Rough Cost: Approx US $1,000-$1,500
- * Requirements: Cost includes one Votan Voice Recognition ISA-bus
- board for 386/486-based machines. A software development system is
- also available for DOS and Unix.
- * Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice
- users. A telephone interface is also available. There is also a
- 4GL and a software development system. Apparently there is more
- than one version - more info required.
- * Contact: 800-877-4756, 510-426-5600
-
- Entropic's HTK (HMM Toolkit)
- * Platform: Range of Unix platforms.
- * Description: HTK is a software toolkit for building continuous
- density HMM based speech recognisers. It consists of a number of
- library modules and a number of tools. Functions include speech
- analysis, training tools, recognition tools, results analysis, and
- an interactive tool for speech labelling. Many standard forms of
- continuous density HMM are possible. Can perform isolated word or
- connected word speech recognition. It van model whole words, sub-
- word units. Can perform speaker verification and other pattern
- recognition work using HMMs. HTK is now integerated with the
- ESPS/Waves speech research environment which is described in
- Section 1.8.
- * Misc: The availability of HTK changed in early 1993 when
- Entropic obtained exclusive marketing rights to HTK from the
- developers at Cambridge.
- * Cost: On request.
- * Contact:
- Entropic Research Laboratory,
- 600 Pennsylvania Ave, S.E. Suite 202,
- Washington, D.C. 20003, USA
- Phone: (202) 547-1420.
- email - info@entropic.com
-
- DragonDictate version 3.0
- * Platform: PC
- * Description: Speaker-adaptive recognition system for discrete
- speech. Provides 110,000 word dictionary and also allows user to
- add words. Active vocabulary of 5,000, 30,000, or 60,000 words.
- Allows dictation into almost all DOS applications (word
- processors, spreadsheets, etc.) and hands-free operation of the
- PC.
- * Cost:Prices including audio board and high-quality headset
- microphone:
- + US$695 (5,000 word Starter Edition)
- + US$995 (30,000 word Classic Edition)
- + US$1,995 (60,000 word Power Edition)
- * Requirements: Minimum of 33 Mhz 486 with 8-16M memory and at
- least 29M disk space (depending on product), one 8-bit slot, DOS
- 5.0 and up (also runs in a DOS box under Windows or OS/2).
- * Contact:
- Dragon Systems, Inc.
- 320 Nevada Street
- Newton, MA 02160, USA
- Tel: 1-617-965-5200, Fax: 1-617-527-0372
-
- DragonDictate for Windows
- * Platform: PC
- * Description: Speech-to-text dictation system. Discrete speech;
- speaker- adaptive. Also provides command/control and mouse
- movement for hands-free operation of Windows. Comes with a 120,000
- word pronunciation dictionary; users can also add their own words
- or phrases. Dictate directly into any application.
- * Rough Cost:Prices including software, documentation and
- microphone:
- + DragonDictate Starter Edition (5,000 words active) -- $395
- + DragonDictate Classic Edition (30,000 words active) -- $695
- + DragonDictate Power Edition (60,000 words active) -- $1,695
- * Requirements: 486/33, 7-10 MB dedicated RAM (depending on
- edition), Windows 3.1 or later. Supported sound boards: Media
- Vision Pro Audio Studio 16, Creative Labs Sound Blaster 16,
- Microsoft Windows Sound System, IBM Audio Capture/Playback
- Adapter.
- * Contact:
- Dragon Systems, Inc.
- 320 Nevada Street
- Newton, MA 02160, USA
- Phone: (617)965-5200 Fax: (617)527-0372
-
- DragonVoiceTools
- * Platform: PC
- * Description: Programmer's toolkit for developing speech-aware
- DOS or Windows applications. Recognizes continuously spoken digits
- and discretely spoken words or phrases. Up to 1,000 words can be
- active at one time. Use words from 110,000 word dictionary
- (included) and/or develop your own word models.
- * Cost:
- + US$1,995 (developer's kit)
- + US$595 (end-user system)
- * Requirements: Minimum of 20 Mhz 386 (larger vocabulary requires
- faster processor) with at least 5M memory and at least 19M disk
- space (depending on vocabulary size), DOS 5.0 and up, Windows 3.1
- and up, Borland C or C++ or Microsoft C or C++. Also requires IBM
- M-ACPA card available from IBM or Dragon Systems ($325).
- * Contact:
- Dragon Systems, Inc.
- 320 Nevada Street,
- Newton, MA 02160, USA
- Tel: 1-617-965-5200, Fax: 1-617-527-0372
-
- IBM VoiceType Dictation
-
- OR: Osborne Personal Dictation System (in Australia)
- * Platform: Intel I486 & IBM OS/2
- * Description: Independent Speaker, discrete speech dictation with
- navigation. Navigation does not require setup, most applications
- are automatically speech enabled by dynamic control analysis.
- Dictation averages 70WPM with 95% accuracy and uses statistical
- trigram modelling. The base system is 22K words, other
- vocabularies available for specific industries.
- * Requirements: 486SX or above, 16MB Ram, 30MB File space,
- Dictation Adapter
- * Cost: Software $495 (includes mic) / Hardware $495
- * Misc 1: A Windows version is now available.
- * Misc 2: Based on IBM Tangora Technology
- * Availability: US English. Other languages (UK, FR, GR, IT, and
- ES) available 3Q94.
- * Contact: US Contact 1-800-TALK-2-ME or 1-914-766-9252.
-
- VoiceServer for Windows
- * Platform: PC
- * Description: Speaker dependent, each with an independent
- directory. Isolated word. Upto 1000 words/user, 300 words/window.
- 1 word occupies 2Kb on hard disk. Can be used to control Windows
- applications by issuing voice commands instead of menu selection.
- * Rough Cost: 292 Pounds(UK)
- * Requirements: None
- * Misc: Price includes a half-sized AT voice card (including a
- DSP), software, documentation & a microphone (attachable to
- keyboard or speaker). A light-weight high-spec headset is an
- optional extra.
- * Contact:
- Mark Redwood
- Applied Voice Technologies
- 26 Danbury Street, Islington,
- London, UK, N1 8JU
- Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225
-
- IN3 Voice Command for Windows
- * Platform: PC with Windows 3.1
- * Description: IN3 is now available for MS-Windows. Users can call
- applications to the foreground with voice commands. Once the
- application is called, the user may enter commands and data with
- voice commands. Voice macros can reduce the strain of repetitive
- stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by
- replacing heavy repetitive keyboard hammering with simple voice
- operations. Voice macros take complex operations and reduce them
- to simple verbal commands. Voice input can provide new facilities
- for tasks which could not easily have been otherwise performed
- without the multiple axis of input. IN3 is hardware-independent,
- users with any Windows-compatible audio add speech recognition to
- the desktop. IN3 works with either 8 bit or 16 bit Windows audio
- boards. IN3 is based on continuous word-spotting technology. A
- developer API is also available for creating voice-enabled
- applications.
- * Price: $179 U.S.
- * Requirements: PC with 80386 processor or better, Microsoft
- Windows 3.1, and Windows compatible audio system with microphone.
- * Misc: Fully functional demos are available on Compuserve in
- various Multimedia and CAD forums. Demos are also available from
- "America on Line", the comp.binaries.ms-windows archive sites, and
- various BBS systems. It is also available by anonymous ftp
- +
- ftp://ftp.wustl.edu/usenet/comp.binaries.ms-windows/v3/in3demo
- .zip
- + ftp://ftp.uwasa.fi/mirror/ultrasound/demo/in3demo.zip
- An equivilant Sun product is described below.
- * Contact:
- Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-925-7950
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
- IN3 Voice Command
- * Platform: Sun SPARCstation
- * Description: IN3 provides a secure, robust, word spotting,
- continuous speech recognition facility for the Sun OS or Solaris
- operating systems. The recognition system is a secure operating
- system facility capable of working with various interfaces,
- microphones, and devices. The operating system interface works
- with native UNIX outside of X Windows as well as provides enhanced
- X Windows facilities including named window support. The user
- interface provides a means to quickly create commands on the fly
- for replacing long strings and complex operations with voice
- macros. [Voice macros can reduce the strain of repetitive stress
- injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by replacing
- heavy repetitive keyboard hammering with simple voice operations.
- ] The IN3 user interface works with generic X servers and window
- managers. A developer API is also available for creating voice-
- enabled applications, interfacing with other audio sources, and
- providing extensive application control over the recognition
- facility.
- * Availability: SunSite archive at SunSITE.unc.edu as well as on
- Catalyst CDware as both a runable demo and unlockable software.
- * Hardware Required: Sun SPARCstation with audio input. Noise
- canceling microphone recommended but not required.
- * Software Required:
- + Sun OS 4.1.2 with OpenWindows 3.0
- + or, Sun OS 4.1.3
- + or, Solaris 2.1 or Solaris 2.2
- * Misc: An equivilant MS-Windows product is described above.
- * Price: $495 U.S.
- * Contact:
- Brantley Kelly
- Email: cbk@gacc.atl.ga.us CIS: 75120,431
- FAX: 1-404-925-7924 Phone: 1-404-813-8030
- Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
-
- Phonetic Engine 400 (PE400) - Speech Systems, Inc.
- * Platform: PC
- * Description: Speaker independent, large vocabulary, continuous
- speech recognition for MS Windows or DOS.
- * Rough Cost: $1195 US dollars. Includes board, microphone,
- developer kit, documentation, 2 days of technical training and 90
- days of technical support.
- * Requirements: IBM AT class machine or better plus 5M disk space.
- Most processing is performed on-board (4M standard or 16M
- upgrade).
- * Misc: Requires developer to provide a context-free grammar.
- Vocabulary size unknown (quotes from 500 - 2000 words per
- grammar), but dynamic grammar switching capabilities may increase
- the effective vocabulary size. Development system includes
- lower-level C,C++ library (VoiceLib), higher-level DLL (SPOT)
- callable from many languages, SPOT/VBX, a custom control for
- Visual Basic and Visual C++.
- * Contact:
- Speech Systems, Inc.
- 2945 Center Green Court South
- Boulder, CO 80301-2275, USA
- Tel: 303.938.1110 Fax: 303.938.1874
-
- SayIt
- * Platform: Sun SPARCstation
- * Description: Voice recognition and macro building package for
- Suns in the Openwindows 3.0 environment. Speaker dependent
- discrete speech recognition. Vocabularies can be associated to
- applications and the active vocabulary follows the application
- that has input focus. Macros can include mouse commands,
- keystrokes, Unix commands, sound, Openwindow actions and more. An
- evaluation copy is available by email.
- * Hardware: Microphone required (SunMicrophone is fine).
- * Cost: $US295
- * Contact:
- Phone: 1-800-245-UNIX or 1-415-572-0200
- Fax: 1-415-572-1300
- Email: info@qualix.com
-
- Kurzweil Voice for Windows
- * Platform: MS Windows 3.1
- * Description: Kurzweil Voice for Windows is a dictation product
- enabling the user to create text and enter data by speaking to
- Windows-based applications. System is adaptive but requires no
- initial training. Users can choose either 30,000 or 60,000 word
- active vocabulary. Application command translation templates for
- popular Windows application such as WordPerfect, 1-2-3, Organizer,
- Word.
- * Cost: US $995
- * Hardware: 486DX/33 or higher, 8 or 16 MB dedicated memory
- (depends on vocabulary, 30 MBs dedicated disk space, VGA or
- higher, Kurzweil-supplied microphone and DSP board.
- * Contact:
- Phone: 1-800-380-1234
- Email: info@kurz-ai.com
-
- D6006 Voice Control Processor
- * Platform: ?
- * Description: ?
- * Contact:
- DSP Telecommunications Inc.
- 2855 Kifer Road, Suite 202, Santa Clara CA 95051, USA
- Tel:(408)986-4310
- Fax:(408)986-4324
-
- Speech Commander - Listen for Windows
- * Platform: ?
- * Description: ?
- * Contact:
- Verbex Voice Systems
- 1090 King Georges Post Rd., Bldg 107,
- Edison NJ 08837, USA
- Tel:(908)225-5225
- Fax:(908)225-7764
-
- Voice-Trek 2.0
- * Platform: ?
- * Description: ?
- * Contact:
- Tardis Technology Inc., Voice Recognition Div.
- 10321 Los Alamitos Blvd., Los Alamitos CA 90720
- Tel:(310)799-3355 Fax:(310)799-3360
-
- Visus SpeechKit
- * Platform: NeXT
- * Description: SpeechKit is based on SPHINX, a
- speaker-independent, 1000 word or so, continuous speech
- recognition system which allows you to incorporate speech
- recognition into your applications. You can design your vocabulary
- and grammars.
- * Contact: Visus - no address or phone provided. A possible
- contact is Robert Brennan at Carnegie Mellon University. email:
- Robert_Brennan@cmu.edu
-
- recnet
- * Platform: UNIX
- * Description: Speech recognition for the speaker independent
- TIMIT and Resource Management tasks. It uses recurrent networks to
- estimate phone probabilities and Markov models to find the most
- probable sequence of phones or words. The system is a snapshot of
- evolving research code. There is no documentation other than
- published research papers. The components are:
- + A preprocessor which implements many standard and many non-
- standard front end processing techniques.
- + A recurrent net recogniser and parameter files
- + Two Markov model based recognisers, one for phone recognition
- and one for word recognition
- + A dynamic programming scoring package The complete system
- performs competatively.
- * Cost: Free
- * Requirements: TIMIT and Resource Management databases
- * Contact: Tony Robinson: ajr@eng.cam.ac.uk
- * Availability: by anonymous ftp
- +
- ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/misc/recnet-1.3.ta
- r.Z
-
- Lotec Speech Recognition Package
- * Platform: Sun
- * Description: Public domain speech recognition software. Operates
- from input in Sun audio format (.au files) and outputs word
- hypotheses and time labelling data. The software includes programs
- to collect speech samples, a labeller, a "featurizer" which
- parameterises speech files, a word spotter and the recogniser. The
- software can perform real time recognition on a Sparc 10 for small
- vocabularies.
- * Requirements: Sun SPARC audio input and a "decent" microphone
- Sun multimedia demo software (in /usr/demo/SOUND) and X.
- * Availability: By anonymous ftp
- + ftp://ftp.sanpo.t.u-tokyo.ac.jp/pub/nigel/lotec/lotec.tar.Z
- * Contact: Nigel Ward: nigel@sanpo.t.u-tokyo.ac.jp
-
- Myers' Hidden Markov Model software
- * Description: Hidden Markov model software for automatic speech
- recognition. C++ code that implements a basic left-right hidden
- Markov model and corresponding Baum-Welch (ML) training algorithm.
- It is meant as an example of the HMM algorithms described by
- L.Rabiner and others. The code was built in order to learn how HMM
- systems work and we are now offering it to the net so that others
- can learn how to use HMMs for speech recognition. Keep in mind
- that ease of understanding was pit primary concern, not
- efficiency. The code can be used to build an experimental speech
- recognition systems using "train_hmm" and "test_hmm", and can be
- used in conjunction with written tutorials on HMMs to understand
- how they work.
- * Availability: By anonymous ftp from the comp.speech archive
- site. There are three files in the directory
- + ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/sources
- The files are
- + hmm.README
- + hmm-1.0.tar.Z
- + OR, hmm-1.0.tar.gz
- (Note: hmm-1.0.tar.Z and hmm-1.0.tar.gz compressed and GNU compressed
- versions of the same files)
- * Contact: Richard Myers: email rmyers@ics.uci.edu
-
- Voice Command Line Interface
- * Platform: Amiga
- * Description: VCLI will execute CLI commands, ARexx commands, or
- ARexx scripts by voice command through your audio digitizer. VCLI
- allows you to launch multiple applications or control any program
- with an ARexx capability entirely by spoken voice command. VCLI is
- fully multitasking and will run in the background, continuously
- listening for your voice commands even while other programs are
- running. Documentation is provided in AmigaGuide format. VCLI 6.0
- runs under either Amiga DOS 2.0 or 3.0.
- * Cost: Free?
- * Requirements: Supports the DSS8, PerfectSound 3, Sound Master,
- Sound Magic, and Generic audio digitizers.
- * Availability: by ftp from wuarchive.wustl.edu in the file
- systems/amiga/incoming/audio/VCLI60.lha and from
- amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha
- * Contact: Author's email is RHorne@cup.portal.com
-
- DATAVOX - French
- * Platform: PC
- * Description: Continuous speech - speaker independent or
- dependent.
- * Rough Cost: ?
- * Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an
- A/D - D/A module (ASA116)
- * Misc: Application software may dialog with DATAVOX through 2
- types of interfaces :
- + Keyboard overlay: The application software may be used with
- any PC compatible package. No specific adaptation is
- necessary, you only need to define your configuration with
- the application software.
- + C library: Allows a user-written program to drive the
- recognition system.
- DATAVOX is based on the AMADEUS speech recognition software developed
- at LIMSI. It provides
- + Continuous speech recognition with 500 words speaker
- dependent, 50 words speaker independent (custom-made
- vocabulary).
- + Grammar of the application language (syntax acquisition,
- verification and simplification software).
- + Large vocabulary : DATAVOX can recognize vocabularies of
- several thousand words as long as there are no more than 500
- words in the active vocabulary at any given node. It takes
- less than 1 second to change syntax and vocabulary.
- + Training controlled by the system (use of co-articulation
- models).
- + Response time less than 500 ms for any phrase length.
- + Synthetis (ADPCM) can be heard simultaneously while
- recognition is being carried out.
- * Contact:
- VECSYS
- Le Chene rond, 91570 Bievres, France
- Fax: 33 1 69 41 24 30
- Voice: 33 1 69 41 15 04
-
- PowerSecretary
- * Platform: Centris 650, 660AV. Quadra 650, 660AV, 700,800, 840AV,
- 900, 950.
- * Description: Speaker dependent/adaptive system requiring words
- to be separated by short pauses.
- * Vocabulary: 30,000 at any one time, automatically selected from
- 120,000-word dictionary.
- * Cost: US$2,495; non-AV machines need an audio board will cost
- about US$300.
- * Requirements: Minimum of 16M of ram and System 7.0.
- * Contact:
- Articulate Systems
- 600 W. Cummings Park, Suite 4500
- Woburn, MA 01801
- Ph: (617) 935-5656 Fax: (617) 935-0490.
-
- ICSS system from IBM
- * Description: A large vocabulary, speaker independent, continuous
- speech system which runs under Windows, OS/2, and AIX.
- * Requirements: Soundboard (e.g. Soundblaster)
- * Price: $US319
- * Contact:
- A&G Graphics Interface
- ICSS Reseller
- 51 Gore Street, Cambridge, MA, 02139, USA
- (617) 492-0120
-
- Custom Voice(TM) by A&G Graphics Interface
- * Description: Speech recognition custom control for Visual Basic,
- Visual C++, Borland C++, and other development platforms that
- support *.VBX. Provides an engine/proprietary independent
- development platform for speech recognition. Currently supports
- ICSS, but should soon support other platforms. Includes a grammar
- debugger and parser APIs to parse spoken speech into useful data
- types.
- * Requirements: Visual Basic or any development platform that
- supports VBX.
- * Price: $US495 or $695 bundled with ICSS.
- * Contact:
- A&G Graphics Interface
- 51 Gore Street, Cambridge, MA, 02139, USA
- (617) 492-0120
-
- Creative VoiceAssist
- * Platform: PC (?)
- * Price: $US99.95
- * Contact:
- Creative Labs
- Ph: 1-800-998-5227
-
- _________________________________________________________________
-
-
-
-
- Andrew Hunt
- ---
- Speech Technology Research Group Ph: 61-2-351 4509
- Dept. of Electrical Engineering Fax: 61-2-351 3847
- University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au
-